Optimized Content Extraction from web pages using Composite Approaches

Sheba Gaikwad; G. Naveen Sundar

doi:https://doi.org/10.14445/22312803/IJCTT-V4I3P149

Research Article | Open Access | Download PDF

Volume 4 | Issue 3 | Year 2013 | Article Id. IJCTT-V4I3P149 | DOI : https://doi.org/10.14445/22312803/IJCTT-V4I3P149

Optimized Content Extraction from web pages using Composite Approaches

Sheba Gaikwad, G. Naveen Sundar

Citation :

Sheba Gaikwad, G. Naveen Sundar, "Optimized Content Extraction from web pages using Composite Approaches," International Journal of Computer Trends and Technology (IJCTT), vol. 4, no. 3, pp. 450-453, 2013. Crossref, https://doi.org/10.14445/22312803/IJCTT-V4I3P149

Abstract

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.

Keywords

Data mining, Information Extraction, Content extraction, HTML, Open source intelligence, Information filtering.

References

[1] R. Alam, A.F.R. Rahman, H. Alam , R. Hartono, “Content extraction from HTML Documents”, in : 1st Int. Workshop on Web Document Analysis (WDA2001), 2001.
[2] S. Chakrabarti, Mining the Web : Discoverin Knowledge from Hypertext Data , Morgan Kaufmann Publishers, 2003.
[3] T. Gottron, “Content code blurring: A new approach to content extraction”, Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, IEEE Computer Society Press, Washington, DC, USA, 2008, pp.29–33.
[4] S. Gupta, G. Kaiser, D. Neistadt, P. Grimm, “DOMbased content extraction of HTML documents”, Proceedings of the 12th International Conference on World Wide Web, WWW ’03, ACM, New York, NY, USA, 2003, pp. 207–214.
[5] S. -H. Lin, J.-M. Ho, “Discovering informative content blocks from Web documents”, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, USA, 2002, pp. 588–593.
[7] C. Mantratzis, M. Orgun, S. Cassidy, “ Separating XHTML content from navigation clutter using DOM-structure block analysis”, in: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, HYPERTEXT ’05, ACM, New York, NY, USA, 2005, pp. 145–147.
[8] P.A.R. Qureshi, N. Memon, U.K. Wiil, “Statistical model for content extraction”, European Intelligence and Security Informatics Conference (EISIC), IEEE Computer Society Press, Athens, Greece, September 2011, pp. 129–134.
[9] T. Weninger, W.H. Hsu, J. Han, “CETR: content extraction via tag ratios”, Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, 2010, pp. 971–980.
[10] T.V. Raman, Toward 2W , beyond web 2.0, Commun. ACM 52 (February 2009) 52–59.
[11] Suhit Gupta, Gail Kaiser, Salvatore Stolfo , “Extracting context to improve accuracy for HTML content extraction” , Proceedings of WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web, ACM, Pages 1114-1115.
[12] B. Adelberg , NoDoSE – a tool for semi automatically extracting structured and semistructured data from text documents, SIGMOD Rec. 27 (June 1998) 283 – 294.
[13] L. Liu, C. Pu, W. Han, XWRAP: An XML-enabled wrapper construction system for web information sources, in: Proceedings of the 16th International Conference on Data Engineering, IEEE Computer Society Press, Washington, DC, USA, 2000, p. 611.
[14] Pir Abdul Rasool Qureshi , Nasrullah Memon, Hybrid model of Content Extraction, Journal of Computer and System Sciences , Volume 78, Issue 4, July 2012, Pages 1248–12.