A Comparative Study of Hidden Web Crawlers

Sonali Gupta; Komal Kumar Bhatia

doi:10.14445/22312803/IJCTT-V12P122

Research Article | Open Access | Download PDF

Volume 12 | Number 1 | Year 2014 | Article Id. IJCTT-V12P122 | DOI : https://doi.org/10.14445/22312803/IJCTT-V12P122

A Comparative Study of Hidden Web Crawlers

Sonali Gupta , Komal Kumar Bhatia

Citation :

Sonali Gupta , Komal Kumar Bhatia, "A Comparative Study of Hidden Web Crawlers," International Journal of Computer Trends and Technology (IJCTT), vol. 12, no. 1, pp. 111-118, 2014. Crossref, https://doi.org/10.14445/22312803/IJCTT-V12P122

Abstract

A large amount of data on the WWW remains inaccessible to crawlers of Web search engines because it can only be exposed on demand as users fill out and submit forms. The Hidden web refers to the collection of Web data which can be accessed by the crawler only through an interaction with the Web-based search form and not simply by traversing hyperlinks. Research on Hidden Web has emerged almost a decade ago with the main line being exploring ways to access the content in online databases that are usually hidden behind search forms. The efforts in the area mainly focus on designing hidden Web crawlers that focus on learning forms and filling them with meaningful values. The paper gives an insight into the various Hidden Web crawlers developed for the purpose giving a mention to the advantages and shortcoming of the techniques employed in each.

Keywords

WWW, Surface Web, Hidden Web, Deep Web, Crawler, search form, Surfacing, Virtual Integration.

References

[1] Michael Bergman, “The deep Web: surfacing hidden value”. In the Journal Of Electronic Publishing 7(1) (2001).
[2] Sonali Gupta, Komal Kumar Bhatia: Exploring ‘Hidden’ parts of the Web: the Hidden Web, in 4rth International Conference on Advances in recent technologies in communication and computing, ARTCom 2012 proceedings in Lecture Notes in Electrical Engineering , Springer Verlag Berlin Heidelberg , ISSN 1876-1100, p.p. 508-515, 2012.
[3] S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web. In: the proceedings of the 27th International Conference on Very large databases VLDB’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, p.p. 129-138.
[4] Ping Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query Selection Techniques for E?cient Crawling of Structured Web Sources. In ICDE, 2006
[5] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A. Halevy : Google’s Deep-Web Crawl. In proceedings of Very large data bases VLDB endowment, pp. 1241-1252, Aug. 2008.
[6] L. Barbosa, J. Freire : Siphoning hidden-web data through keyword-based interfaces. In: SBBD, 2004, Brasilia, Brazil, pp. 309-321.
[7] Komal kumar Bhatia, A.K.Shrma, Rosy Madaan: AKSHR: A Novel Framework for a Domain-specfic Hidden web crawler. In Proceedings of the first international Conference on Parallel, Distributed and Grid Computing, 2010.
[8] Sonali Gupta, Komal Kumar Bhatia: HiCrawl: A Hidden Web crawler for Medical Domain in proceedings of 2013 IEEE International Symposium on Computing and Business Intelligence, ISCBI, August18-18, 2013 Delhi , India .
[9] S. W. Liddle, D. W. Embley, D. T. Scott, S. H. Yau. Extracting Data Behind Web Forms. In: 28th VLDB Conference2002 , HongKong, China.
[10] A. Bergholz, B. Chidlovskii. Crawling for domain-specific Hidden Web resources. In Proceedings of the Fourth International Conference on Web Information Systems Engineering (WISE’03). pp.125-133 IEEE Press, 2003
[11] L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1–6, 2005.
[12] A. Ntoulas, P. Zerfos, J.Cho. Downloading Textual Hidden Web Content Through Keyword Queries. In: 5th ACM/IEEE Joint Conference on Digital Libraries (Denver, USA, Jun 2005) JCDL05, pp. 100-109.
[13] L.Barbosa and J.Freire, An adaptive crawler for locating hidden-web entry points," in Proc. of WWW, 2007, pp. 441-450.
[14] P.Ipeirotis and L. Gravano, Distributed search over the hidden web: Hierarchical database sampling and selection," in VLDB, 2002.
[15] K.C. Chang, B. He, M.Patel, Z.Zhang : Structured Databases on the Web: Observations and Implications. SIGMOD Record, 33(3). 2004.
[16] B. He, M.Patel, Z.Zhang, K.C. Chang: Accessing the Deep Web: A survey. Communications [17] of the ACM, 50(5):95–101, 2007.
[18] Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, Víctor Carneiro: Crawling the Content Hidden Behind Web Forms. In Proceedings of the 2007 International conference on Computational Science and its applications, Published by Springer-Verlag Berlin, Heidelberg, 2007.