Web data extraction using the approach of segmentation and parsing

  IJCOT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© - September Issue 2013 by IJCTT Journal
Volume-4 Issue-9                           
Year of Publication : 2013
Authors :P. Singam, Prof. P. Pardhi

MLA

P. Singam, Prof. P. Pardhi "Web data extraction using the approach of segmentation and parsing"International Journal of Computer Trends and Technology (IJCTT),V4(9):3200-3206 September Issue 2013 .ISSN 2231-2803.www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract:- Given the URL’s, automatically extracting the data from these result pages is very important for many applications, such as data integration, which need to cooperate with multiple web databases. In this paper we present a method which can extract the data of our interest out of the identified data regions, filter out the unwanted data records and finally put the extracted data into the table or export to csv files. Extraction procedure includes segmentation of contiguous as well as non contiguous data region, filtration of noise, and applying parsers. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this.

 

References-
[1] Jun Kong, Omer Barkol, et al., “Web Interface Interpretation Using Graph Grammars”, IEEE transactions on systems, man, and cybernetics— part c: applications and reviews, vol. 42, no. 4, july 2012
[2] Mohammed Kayed and Chia-Hui Chang, “ FiVaTech: Page-Level Web Data Extraction from Template Pages”, IEEE transactions on knowledge and data engineering, vol. 22, no. 2, february 2010
[3] Jer Lang Hong, “Data Extraction for Deep Web Using WordNet”, IEEE transactions on systems, man, and cybernetics—part c: applications and reviews, vol. 41, no. 6, november 2011
[4]Weifeng Su, Jiying Wang, Frederick H. Lochovsky , “Combining Tag and Value Similarity for Data Extraction and Alignment” IEEE transactions on knowledge and data engineering, vol. 24, no. 7, july 2012
[5] Zaiqing Nie, Ji-Rong Wen, and Wei-Ying Ma, “Statistical Entity Extraction From Web”
[6] Luis Tari, Phan Huy Tu, Jo¨ rg Hakenberg, Yi Chen, Tran Cao Son, Graciela Gonzalez, and Chitta Baral “Incremental Information Extraction Using Relational Databases”, IEEE transactions on knowledge and data engineering, vol. 24, no. 1, january 2012
[7] Hassan A. Sleiman and Rafael Corchuelo, “A Survey on Region Extractors From Web Documents”, IEEE transactions on knowledge and data engineering
[8] Dave King “Introduction to the Web Mining Minitrack”, 2012 45th Hawaii International Conference on System Sciences
[9] Alberto H. F. Laender, et.al. “A Brief Survey of Web Data Extraction Tools”, Department of Computer ScienceFederal University of Minas Gerais 31270901n Belo Horizonte MG Brazil
[10] Y. Zhai and B. Liu, “Structured Data Extraction from the Web Based on Partial Tree Alignment,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006. Books:
[11] ERCIM NEWS 34 89 April 2012 “Special theme:Big Data”
[12] A Comparison of Leading Data Mining Tools (ARTICAL) John F. Elder IV & Dean W. Abbott Elder Research

Keywords — : Data region, Data extraction, DOM structure, Harvesting, Web data.