A Systematic Review on Web Page Classification

© 2020 by IJCTT Journal
Volume-68 Issue-4
Year of Publication : 2020
Authors : Ajose-Ismail, B.M, Osanyin Q.A
DOI :  10.14445/22312803/IJCTT-V68I4P115

How to Cite?

Ajose-Ismail, B.M, Osanyin Q.A, "A Systematic Review on Web Page Classification," International Journal of Computer Trends and Technology, vol. 68, no. 4, pp. 81-86, 2020. Crossref, https://doi.org/10.14445/22312803/IJCTT-V68I4P115

With the increase in digital documents on the World Wide Web and an increase in the number of webpages and blogs which are common sources for providing users with news about current events, aggregating and categorizing information from these sources seems to be a daunting task as the volume of digital documents available online is growing exponentially. Although several benefits can accrue from the accurate classification of such documents into their respective categories such as providing tools that help people to find, filter and analyze digital information on the web amongst others. Accurate classification of these documents into their respective categories is dependent on the quality of training dataset which is dependent on the preprocessing techniques. Existing literature in this area of web page classification identified that better document representation techniques would reduce the training and testing time, improve the classification accuracy, precision and recall of classifier. In this paper, we give an overview of web page classification with an in-depth study of the web classification process, while at the same time creating awareness of the need for an adequate document representation technique as this helps capture the semantics of document and also contribute to reduce the problem of high dimensionality.

Bags of words model, Classification, Machine learning, Document representation, TF-IDF, Web Page classification, LDA, Word2Vec.

[1] Raj, A. J., Francis, F. S., & Benadit, P. J. (2016). “Optimal Web Page Classification Technique Based on Informative Content Extraction and FA-NBC”. Computer Science and Engineering, 6(1), 7-13.
[2] Deri, L., Martinelli, M., Sartiano, D., & Sideri, L. (2015, November). “Large scale web-content classification. In Knowledge Discovery”, Knowledge Engineering and Knowledge Management (IC3K), 2015 7th International Joint Conference on (Vol. 1, pp. 545-554). IEEE.
[3] Dey Sarkar, S., Goswami, S., Agarwal, A., & Aktar, J. (2014). “A Novel Feature Selection Technique for Text Classification Using Naïve Bayes”. International Scholarly Research Notices, 2014.
[4] Shibu, S., Vishwakarma, A., & Bhargava, N. (2010). “A combination approach for web page Classification using Page Rank and Feature Selection Technique”. International Journal of Computer Theory and Engineering, 2(6), 897.
[5] Dixit, S., & Gupta, R. K. (2015). “Layered Approach to Classify Web Pages using Firefly Feature Selection by Support Vector Machine (SVM)”. International Journal of u-and e-Service, Science and Technology, 8(5), 355-364.
[6] Qi, X., & Davison, B. D. (2009). “Web page classification: Features and algorithms”. ACM computing surveys (CSUR), 41(2), 1-31.
[7] Abdelbadie B., Abdellah I., & Mohammed B. (2013). “Web Classification Approach Using Reduced Vector Representation Model Based On Html Tags”. Journal of Theoretical and Applied Information Technology, 55(1).
[8] AbdulHussien, A. A. (2017). “Comparison of Machine Learning Algorithms to Classify Web Pages”. International Journal of Advanced Computer Science and Applications (IJACSA), 8(11),
[9] Mangai, J. A., Kothari, D. D., & Kumar, V. S. (2012). “A Novel Approach for Automatic Web Page Classification using Feature Intervals”. International Journal of Computer Science Issues (IJCSI), 9(5).
[10] Huang, C. C., Chuang, S. L., & Chien, L. F. (2004). “Using a web-based categorization approach to generate thematic metadata from texts”. ACM Transactions on Asian Language Information Processing (TALIP), 3(3), 190-212.
[11] Mangai, J. A., Kumar, V. S., & Balamurugan, S. A. (2012). “A novel feature selection framework for automatic web page classification”. International Journal of Automation and Computing, 9(4), 442-448.
[12] Haveliwala, T. H. (2003). “Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search”. IEEE transactions on knowledge and data engineering, 15(4), 784-796.
[13] Cui, H., Kan, M. Y., Chua, T. S., & Xiao, J. (2004, July). “A comparative study on sentence retrieval for definitional question answering”. In SIGIR Workshop on Information Retrieval for Question Answering (IR4QA) (pp. 383-390).
[14] Hammami, M., Chahir, Y., & Chen, L. (2003, October). “WebGuard: Web based adult content detection and filtering system”. In Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on (pp. 574-578). IEEE.
[15] Kato, R., & Goto, H. (2016, March). “Categorization of web news documents using word2vec and deep learning”. In Proceedings of the 2016 International Conference on Industrial Engineering and Operations Management Kuala Lumpur, Malaysia.
[16] Fatima, S., & Srinivasu, B. (2017). “Text Document categorization using support vector machine”.
[17] Wang, Z., Ma, L., & Zhang, Y. (2016, June). “A Hybrid Document Feature Extraction Method Using Latent Dirichlet Allocation and Word2Vec”. In Data Science in Cyberspace (DSC), IEEE International Conference on (pp. 98-103). IEEE.
[18] Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). “A review of machine learning algorithms for text-documents classification”. Journal of advances in information technology, 1(1), 4-20.
[19] Alamelu Mangai, J., Santhosh Kumar, V., & Sugumaran, V. (2010). “Recent Research in Web Page Classification–A Review”. International Journal of Computer Engineering & Technology (IJCET), 1(1), 112-122.
[20] Yin, D., Hu, Y., Tang, J., Daly, T., Zhou, M., Ouyang, H., & Langlois, J. M. (2016, August). “Ranking relevance in yahoo search”. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 323-332). ACM.
[21] Socher, Richard, Perelygin, Alex, Wu, Jean Y., Chuang, Jason, Manning, Christopher D., Ng, Andrew Y., and Potts, Christopher (2013). “Recursive deep models for semantic compositionality over a sentiment treebank”. In Conference on Empirical Methods in Natural Language Processing.
[22] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). “Learning word vectors for sentiment analysis”. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association for Computational Linguistics.
[23] Biro, I., Benczur, A., Szabo, J., & Maguitman, A. (2008, October). “A comparative analysis of latent variable models for web page classification”. In Latin American Web Conference, 2008. LA-WEB`08. (pp. 23-28). IEEE.
[24] Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). “Support vector machines and word2vec for text classification with semantic features”. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on (pp. 136-140). IEEE.
[25] Singh, K. N., Devi, H. M., & Mahanta, A. K. (2017). “Document representation techniques and their effect on the document Clustering and Classification: A Review”. International Journal of Advanced Research in Computer Science, 8(5).
[26] Jindal, R., Malhotra, R., & Jain, A. (2015). “Techniques for text classification: Literature review and current trends”. Webology, 12(2), 1.
[27] Azam, N., & Yao, J. (2012). “Comparison of term frequency and document frequency based feature selection metrics in text categorization”. Expert Systems with Applications, 39(5), 4760-4768.
[28] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). “Indexing by latent semantic analysis”. Journal of the American society for information science, 41(6), 391.
[29] Hofmann, T. (1999, August). “Probabilistic latent semantic indexing”. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM
[30] Le, Q., & Mikolov, T. (2014, January). “Distributed representations of sentences and documents”. In International conference on machine learning (pp. 1188-1196).
[31] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). “Latent dirichlet allocation”. Journal of machine Learning research, 3(Jan), 993-1022.
[32] Dit, B., Panichella, A., Moritz, E., Oliveto, R., Di Penta, M., Poshyvanyk, D., & De Lucia, A. (2013, May). “Configuring topic models for software engineering tasks in tracelab”. In Traceability in Emerging Forms of Software Engineering (TEFSE), 2013 International Workshop on (pp. 105-109). IEEE.
[33] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). “Efficient estimation of word representations in vector space”. arXiv preprint arXiv:1301.3781.
[34] Moiseev, G. (2016). Classification of E-commerce Websites by Product Categories. In AIST (Supplement) (pp. 237-247).
[35] Turney, P. D., & Pantel, P. (2010). “From frequency to meaning: Vector space models of semantics”. Journal of artificial intelligence research, 37, 141-188.