Classification of Page to the aspect of Crawl Web Forum and URL Navigation

  IJCTT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© 2015 by IJCTT Journal
Volume-20 Number-1
Year of Publication : 2015
Authors : Yerragunta Kartheek, T.Sunitha Rani

MLA

Yerragunta Kartheek, T.Sunitha Rani "Classification of Page to the aspect of Crawl Web Forum and URL Navigation". International Journal of Computer Trends and Technology (IJCTT) V20(1):7-11, Feb 2015. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract -
Technologically Web has different meaning, but considered to be the backbone of today’s Information technology World. Considering the fact of web based Data which is crawling over the surface of network which we call as Internet. Semantically and syntax based web has its own rule and implementation procedure which follow some protocol; but there also some consequence like classifying the page, url pattern etc. In this Paper, we try to put the concept of the page classification based on the Meta data and description based. In considering the millions of web forum and pros and cons we implemented the concept of the pattern matching based user navigation to corresponding information url. In order to classify the url navigation based on information retrieval which in other term call as mining the data may be information for someone and vice versa for data, we implemented the pattern matching of regular and semantic regulate methodology of data processing based on url type and the discretion of meta tag.

References
[1] Blog, http://en.wikipedia.org/wiki/Blog, 2012.
[2]“ForumMatrix,” http://www.forummatrix.org/index.php, 2012.
[3]HotScripts, http://www.hotscripts.com/index.php, 2012.
[4]InternetForum, http://en.wikipedia.org/wiki/Internet_forum,2012.
[5] “Message Boards Statistics,” http://www.big-boards.com/statistics/, 2012.
[6]nofollow, http://en.wikipedia.org/wiki/Nofollow, 2012.
[7] “RFC 1738—Unifor Resource Locators (URL),” http://www.ietf.org/rfc/rfc1738.txt, 2012.
[8]SessionID, http://en.wikipedia.org/wiki/Session_ID, 2012.
[9]“The Sitemap Protocol,” http://sitemaps.org/protocol.php, 2012.
[10] “The Web Robots Pages,” http://www.robotstxt.org/, 2012.
[11]“WeblogMatrix,” http://www.weblogmatrix.org/, 2012.
[12] S. Brin and L. Page, “The Anatomy of a Large-Scale HypertextualWeb Search Engine.” Computer Networks and ISDN Systems, vol. 30,nos. 1-7, pp. 107-117, 1998.
[13] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “iRobot: AnIntelligent Crawler for Web Forums,” Proc. 17th Int’l Conf. WorldWide Web, pp. 447-456, 2008.
[14] A. Dasgupta, R. Kumar, and A. Sasturkar, “De-Duping URLs viaRewrite Rules,” Proc. 14th ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining, pp. 186-194, 2008.
[15] C. Gao, L. Wang, C.-Y. Lin, and Y.-I. Song, “Finding Question-Answer Pairs from Online Forums,” Proc. 31st Ann. Int’l ACMSIGIR Conf. Research and Development in Information Retrieval,pp. 467-474, 2008.
[16] N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T.Tomokiyo, “Deriving Marketing Intelligence from Online Discussion,”Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery andData Mining, pp. 419-428, 2005.

Keywords
EIT path, forum crawling, ITF regex, page classification, page type, URL pattern learning, URL type.