Web Crawler: Extracting the Web Data

  IJCTT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© 2014 by IJCTT Journal
Volume-13 Number-3
Year of Publication : 2014
Authors : Mini Singh Ahuja , Dr Jatinder Singh Bal , Varnica
DOI :  10.14445/22312803/IJCTT-V13P128

MLA

Mini Singh Ahuja , Dr Jatinder Singh Bal , Varnica. "Web Crawler: Extracting the Web Data". International Journal of Computer Trends and Technology (IJCTT) V13(3):132-137, July 2014. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract -
Internet usage has increased a lot in recent times. Users can find their resources by using different hypertext links. This usage of Internet has led to the invention of web crawlers. Web crawlers are full text search engines which assist users in navigating the web. These web crawlers can also be used in further research activities. For e.g. the crawled data can be used to find missing links, community detection in complex networks. In this paper we have reviewed web crawlers: their architecture, types and various challenges being faced when search engines use the web crawlers.

References
[1] Christopher Olston and Marc Najork (2010), “Web Crawling”, now the essence of knowledge, Vol. 4, No. 3 (2010) 175–246.
[2] DhirajKhurana, Satish Kumar (2012), “Web Crawler: A Review”, International Journal of Computer Science & Management Studies, Vol. 12, Issue 01, ISSN: 2231 –5268.
[3] Mohit Malhotra (2013), “Web Crawler And It’s Concepts”.
[4] Subhendukumarpani, Deepak Mohapatra, BikramKeshariRatha (2010), “Integration of Web mining and web crawler: Relevance and State of Art”, International Journal on Computer Science and Engineering Vol. 02, No. 03, 772-776.
[5] Nemeslaki, András; Pocsarovszky, Károly (2011), “Web crawler research methodology”, 22nd European Regional Conference of the International Telecommunications Society.
[6] Shkapenyuk V. and Suel T. (2002), “Design and Implementation of a high-performance distributed web crawler”, In Proc. 18th International Conference on Data Engineering, pp. 357–368.
[7] Gautam Pant, Padmini Srinivasan and FilippoMenczer, “Crawling the Web”, The University of Iowa, Iowa City IA 52242, USA.
[8] Sergey Brin and Lawrence Page, “The Anatomy of a Large Scale Hypertexual Web Search Engine”, Computer Science Department, Stanford University, Stanford.
[9] Sandeep Sharma (2008), “Web-Crawling Approaches in Search Engines”, Thapar University, Patiala.
[10] Minas gjoka (2010), “Measurement of Online Social Networks”, university of california, Irvine.
[11] Salvatore A. Catanese, Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, Alessandro Provetti (2011), “Crawling Facebook for Social Network Analysis Purposes”, WIMS’11, May 25-27, 2011 Sogndal, Norway, ACM 978-1-4503-0148-0/11/05.
[12] Ari Pirkola (2007), “Focused Crawling: A Means to Acquire Biological Data from the Web”, University of Tampere Finland, ACM 978-1-59593-649-3/07/09.
[13] Priyanka-Saxena (2012), “Mercator as a web crawler”, International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, ISSN: 1694-0814.
[14] Vladislav Shkapenyuk, Torsten Suel, “Design and Implementation of a High-Performance Distributed Web Crawler”, NSF CAREER Award, CCR-0093400.
[15] Raja Iswary, Keshab Nath (2013), “Web Crawler”, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 2, Issue 10, ISSN: 2278-1021.
[16] Allan Heydon and Marc Najork, “Mercator: A Scalable, Extensible Web Crawler”, Compaq Systems Research Center
[17] S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.
[18] P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical nformation. In ETTAB: Agents in Bioinformatics, Bologna, Italy, 2002
[19] S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling:Anewapproach to topicspecific Web resource discovery. Computer Networks, 31(11–16):1623–1640, 1999.
[20] K. M. and Michelsen, R. (2002). Search Engines and Web Dynamics. Computer Networks, vol. 39, pp. 289–302, June 2002.

Keywords
web crawler, blind traversal algorithms, best first heuristic algorithms etc.