Template Extraction from Heterogeneous Web Pages Using Text Clustering

T.L.N.Divya; G.Loshma; Dr. Nagaratna P Hegde

doi:10.14445/22312803/IJCTT-V3I3P118

Research Article | Open Access | Download PDF

Volume 3 | Issue 3 | Year 2012 | Article Id. IJCTT-V3I3P118 | DOI : https://doi.org/10.14445/22312803/IJCTT-V3I3P118

Template Extraction from Heterogeneous Web Pages Using Text Clustering

T.L.N.Divya, G.Loshma, Dr. Nagaratna P Hegde

Citation :

T.L.N.Divya, G.Loshma, Dr. Nagaratna P Hegde, "Template Extraction from Heterogeneous Web Pages Using Text Clustering," International Journal of Computer Trends and Technology (IJCTT), vol. 3, no. 3, pp. 424-429, 2012. Crossref, https://doi.org/10.14445/22312803/IJCTT-V3I3P118

Abstract

Now a days most of the information is stored in text databases. This information consists of large collection of documents from Heterogeneous web pages. Now we extract template from these heterogeneous templates, and to extract template we use different algorithms to find similarity of underlying template structures in the documents and we cluster the web documents based on the similarity of underlying template structure in the documents so that template is extracted with various clusters. We use different algorithms to find similarity between the web pages. Previously the algorithms used are RTDM, Text-Hash and Text-Max. But the time and space occupied by this algorithms is more. In this paper we are using WaveK-Means algorithm to find similarity between the web pages. This algorithm provides better performance compared to previous algorithms in terms of space and time. The space and time consumed by this algorithm is less compared to RTDM, Text-Hash and Text-Max. Our Experimental results with real life data sets confirm effectiveness and robustness of our algorithm.

Keywords

Template Extraction, RTDM, Text-Hash, Text-Max, WaveK-means, Clustering.

References

[1] Automatic web news extraction using tree edit distance by D C Reis, P B Golgher, A S Silva, A F Laender .
[2] http://www.w3.org/TR/REC-DOM-Level-1/.
[3] A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc. ACM SIGMOD, 2003.
[4] Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. 11th Int’l Conf. World Wide Web (WWW), 2002.
[5] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min- Wise Independent Permutations,” J. Computer and System Sciences, vol. 60, no. 3, pp. 630-659, 2000.
[6] D. Chakrabarti, R. Kumar, and K. Punera, “Page-Level Template Detection via Isotonic Smoothing,” Proc. 16th Int’l Conf. World Wide Web (WWW), 2007.
[7] Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, “Selectivity Estimation for Boolean Queries,” Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.