Extended CurlCrawler: A focused and path-oriented framework for crawling the web with thumb

Dr Ela Kumar; Ashok Kumar

doi:https://doi.org/10.14445/22312803/ IJCTT-V3I3P105

Research Article | Open Access | Download PDF

Volume 3 | Issue 3 | Year 2012 | Article Id. IJCTT-V3I3P105 | DOI : https://doi.org/10.14445/22312803/IJCTT-V3I3P105

Extended CurlCrawler: A focused and path-oriented framework for crawling the web with thumb

Dr Ela Kumar, Ashok Kumar

Citation :

Dr Ela Kumar, Ashok Kumar, "Extended CurlCrawler: A focused and path-oriented framework for crawling the web with thumb," International Journal of Computer Trends and Technology (IJCTT), vol. 3, no. 3, pp. 327-335, 2012. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V3I3P105

Abstract

Information is a vital role playing versatile thing from availability at church level to web through trends of books. WWW is now the exposed and up-to-date huge repository of information available to everyone, everywhere and every time [1]. It is the thrust arena of engineering endeavor and is evolving without a grand design blueprint. Finally, an age has come, where information has become an instrument, a tool that can be used to solve many problems. The biggest challenge being posed by the Internet is its ever-growing size with the availability of endless pool of information hosted on the World Wide Web (WWW). It is problematic to identify and ping with graphical frame of mind for the desired information amongst the large set of web pages resulted by the search engine with reduced chaffing and cross features of the framework. With further increase in the size of the Internet, the problem grows exponentially. Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site [7, 17]. Needless to say that building an effective web crawler to solve your purpose is not a difficult task, but choosing the right strategies and building an effective architecture will lead to implementation of multi-agent framework to outcome highly featured web crawler application [2, 3]. This paper is an experimental strives to develop and implement an extended framework with extended architecture to make search engines more efficient using local resource utilization features of the programming. This work is an implementation experience for use of focused and path oriented approach to provide a cross featured framework for search engines with human powered approach. In addition to curl programming, personalization of information, caching and graphical perception, main features of this framework are cross platform, cross architecture, focused, path oriented and human powered.

Keywords

Topical, SOAP, Interacting Agent, WSDL, Thumb, Whois, CachedDatabase, IECapture, Searchcon, Main_spider, UDDI.

References

[1]. Segev, Elad (2010). Google and the Digital Divide: The Biases of Online Knowledge, Oxford: Chandos Publishing.
[2].Vaughan, L. & Thelwall, M. (2004). Search engine Coverage bias: evidence and possible causes, Information Processing &Management,40(4), 693-707.
[3].Gandal, Neil (2001). "The dynamics of competition in the internet search engine market". International Journal of Industrial Organization 19 (7): 1103–1117.
[4].Kobayashi, M. and Takeda, K. (2000). "Information retrieval on the web". ACM Computing Surveys (ACM Press).
[5].Steve Lawrence; C. Lee Giles (1999). "Accessibility of information on the web". Nature 400 (6740): 107–9.
[6].Zeinalipour-Yazti, D. and Dikaiakos, M. D. (2002). Design and mplementation of a distributed crawler and filtering processor. In Proceedings of the Fifth Next Generation Information Technologies and Systems (NGITS).
[7].Cho,Junghoo,"Crawling the Web: Discovery and Maintenance of a Large- Scale Web Data", Ph.D. dissertation, Department of Computer Science, Stanford University, November 2001.
[8].Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler.In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press.
[9].Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An daptive model for optimizing performance of an incremental web crawler". In Proceedings of the Tenth Conference on World Wide Web (Hong Kong:Elsevier Science).
[10].Shestakov, Denis (2008). Search Interfaces on the Web: Querying and Characterizing. TUCS Doctoral Dissertations 104, University of Turku.
[11].Chakrabarti, S., van den Berg, M., and Dom, B.(1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31(11–16):1623–1640.
[12].Gray, N. A. B. (2005). "Performance of Java Middleware - Java RMI, JAXRPC, and CORBA". University of Wollongong. pp. 31–39. Retrieved January 11, 2011.
[13].Sergey Brin and Lawrence Page. The anatomy of a large-scale pertextual Web search engine. In Proceedings of the Seventh nternational World Wide Web Conference, pages 107--117, April 1998. [14].Shestakov, Denis (2008). Search Interfaces on the Web:Querying and Characterizing. TUCS Doctoral Dissertations 104, University of Turku.
[15].Z.Smith. The Truth About the Web: Crawling towards Eternity. Web Techniques Magazine, 2(5), May 1997.
[16].Ela Kumaret al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (4) , 2011, 1700-1705.
[17].Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web". in Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer. pp. 153–178. ISBN 9783540406761.
[18].Cho, J.; Garcia-Molina, H.; Page, L. (1998-04)."Efficient Crawling Through URL Ordering" Seventh International World-Wide Web Conference. Brisbane, Australia.