Extraction of Unstructured Data Records and Discovering New Attributes from the Web Documents

International Journal of Computer Trends and Technology (IJCTT)          
© 2014 by IJCTT Journal
Volume-17 Number-3
Year of Publication : 2014
Authors : Padmapriya.G , Dr.M.Hemalatha
DOI :  10.14445/22312803/IJCTT-V17P124


Padmapriya.G , Dr.M.Hemalatha. "Extraction of Unstructured Data Records and Discovering New Attributes from the Web Documents". International Journal of Computer Trends and Technology (IJCTT) V17(3):125-132, Nov 2014. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract -
Information extraction is nothing but taking out the structured information from online databases automatically. The major intent of the information extraction process is to extract accurate and correct text portion of documents. Web includes a numerous list of objects like conference programs and comment lists in blogs. From the web, extraction of list of objects is done by utilizing record extraction which discovers a set of Web page segments. To take out data records, a new method called Tag path Clustering is suggested. This method captures a list of objects in a more vigorous way based on a holistic analysis of a Web page. The main focus of this method is how a dissimilar tag path appears continually in the document. A pair of tag path occurrence patterns called visual signals is compared to compute how likely these two tag paths signify the same list of objects. After that, by using a similarity measure which captures how intimately the tag paths emerge and intersperse .Based on the similarity measure clustering of tag paths are employed to extract sets of tag paths that form the structure of the data records. A Bayesian learning framework is proposed to find new data attributes for adapting the information extraction, knowledge formerly learned from a source Web site to a new unseen site and also finding earlier unseen attributes. Expectation maximization improved Bayesian learning techniques are utilized for finding new training data for learning the new wrapper for new unseen sites. This method effectually extracts attributes from the new unseen Web site. Experimental results show that this framework achieves a very promising performance.

[1] Gengxin Miao, Junichi Tatemura. Extracting Data Records from the Web Using Tag Path Clustering. WWW 2009, April 20–24, 2009, Madrid, Spain. ACM 978-1-60558-487-4/09/04.
[2] Tak-Lam Wong and Wai Lam. Learning to Adapt Web Information ExtractionKnowledge and Discovering New Attributesvia a Bayesian Approach. IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 4, April 2010.
[3] A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on the Management of Data, pages 337-348, 2003.
[4] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Web Tables: Exploring the power of tables on the Web. In Proceedings of the 34th International Conference on Very Large Data Bases, pages 538-549, 2008.
[5] C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on the World Wide Web, pages 681-688, 2001.
[6] V. Crescenzi, G. Mecca, and P. Merialdo. Road Runner: Towards automatic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109-118, 2001.
[7] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open Information Extraction from the Web,” Proc. 20th Int’l Joint Conf. Artificial Intelligence (IJCAI), pp. 2670-2676, 2007.
[8] V. Crescenzi and G. Mecca, “Automatic Information Extraction from Large Websites,” J. ACM, vol. 51, no. 5, pp. 731-779, 2004.
[9] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, “Interactive Information Extraction with Constrained Conditional Random Fields,” Proc. 19th Nat’l Conf. Artificial Intelligence (AAAI), pp. 412-418, 2004.
[10] W.Y. Lin and W. Lam, “Learning to Extract Hierarchical Information from Semi-Structured Documents,” Proc. Ninth Int’l Conf. Information and Knowledge Management (CIKM), pp. 250-257, 2000.

Information extraction, data record extraction, clustering, Wrapper adaptation