A Model to Detect Keyword Stuffing Spam on Webpages

Bodunde Odunola Akinyemi

doi:10.14445/22312803/ IJCTT-V71I3P103

Research Article | Open Access | Download PDF

Volume 71 | Issue 3 | Year 2023 | Article Id. IJCTT-V71I3P103 | DOI : https://doi.org/10.14445/22312803/IJCTT-V71I3P103

A Model to Detect Keyword Stuffing Spam on Webpages

Bodunde Odunola Akinyemi

Received	Revised	Accepted	Published
28 Jan 2023	04 Mar 2023	16 Mar 2023	28 Mar 2023

Citation :

Bodunde Odunola Akinyemi, "A Model to Detect Keyword Stuffing Spam on Webpages," International Journal of Computer Trends and Technology (IJCTT), vol. 71, no. 3, pp. 14-20, 2023. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V71I3P103

Abstract

A well-designed website's dominant point and success depend on using keywords. Search engines heavily depend on the concept of keyword analysis to highlight results for search queries on web pages and to establish highly ranked websites. However, keyword stuffing evokes a spam issue with regard to the relevance of the content, so it becomes imperative that appropriate keywords are used to optimise web pages. This study developed a spam detection model to address the problem of keyword stuffing on a webpage. The model was developed by integrating three content analysis detection techniques: rates of compression ratio, average length, and keyword density. The Python programming language was used to implement the proposed approach. To evaluate the model's performance, twenty webpages were selected, out of which the contents of five sites were altered by including more keywords than usual. A simulation of the proposed model was tested on each webpage before and after the alteration of the keywords. The findings showed that before and after manipulation, the edited five sites' average identified keywords ranged from 2% to 3%. According to the results of the density of the pages’ analysis, the average page density ranged from 3% to 5%. The study concluded that a keyword stuffing evaluation and detection model for webpages must be established to prevent online users from being misled and to increase trust between users and search engines.

Keywords

Content-based, Keyword density, Keyword Stuffing, Spam, Webpages.

References

[1] Ahmad Al-Ananbeh et al., “Website Usability Evaluation and Search Engine Optimization for Eighty Arab University Websites,” Basic Science & Engineering, vol. 21, no. 1, pp. 107-122, 2012. [Google Scholar] [Publisher link]
[2] Meenakshi Bansal, and Deepak Sharma, “Improving Webpage Visibility in Search Engines by Enhancing Keyword Density Using Improved On-Page Optimization Technique,” International Journal of Computer Science and Information Technologies, vol. 6, no. 6, pp. 5347-5352, 2015. [Google Scholar] [Publisher link]
[3] Bin Zhou, Jian Pei, and Zhaohui Tang, “A Spamicity Approach to Web Spam Detection,” In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM), pp. 277-288, 2008. [Google Scholar] [Publisher link]
[4] Alexandros Ntoulas et al., “Detecting Spam Web Pages Through Content Analysis,” In Proceedings of the ACM 15th international conference on World Wide Web, pp. 83–92, 2006. [CrossRef] [Google Scholar] [Publisher link]
[5] Ashish Chandra, Mohammad Suaib, and Rizwan Beg, “Google Search Algorithm Updates Against Web Spam,” Informatics Engineering- an International Journal (IEIJ), vol. 3, no. 1, 2015. [CrossRef] [Google Scholar] [Publisher link]
[6] Santiago Villasenor et al., “Scalable Spam Classifier for Web Tables,” 2017 IEEE International Conference on Big Data (Big Data), pp. 4849-4851, 2017. [CrossRef] [Google Scholar] [Publisher link]
[7] Tyler Moore, Nektarios Leontiadis, and Nicolas Christin, “Fashion Crimes: Trending Term Exploitation on the Web,” In Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 455-466, 2011 [CrossRef] [Google Scholar] [Publisher link]
[8] Z. Gyongyi, and H. Garcia-Molina, “Web Spam Taxonomy,” First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 10-14, 2005.
[9] Sachin Kumar, and Pratishtha Gupta, “A Survey of Techniques and Applications for Search Engine Optimization,” Research Journal of Science and Technology, vol. 8, no. 2, 2016. [CrossRef] [Google Scholar] [Publisher link]
[10] F. Javier Ortega, “Detection of Dishonest Behaviors in On-Line Networks Using Graph-Based Ranking Techniques,” AI Communications, vol. 26, no. 3, pp. 327-329, 2013. [Google Scholar] [Publisher link]
[11] Cherukuri Kiranmai, and Gandi Satyanarayana, “Multi-Top Keyword Search Over Outsourced Data Files,” International Journal of Computer and Organization Trends, vol. 8, no. 4, pp. 9-12, 2018. [Publisher link]
[12] Nikita Spirin, and Jiawei Han, “Survey on Web Spam Detection: Principles and Algorithms,” ACM SIGKDD Explorations Newsletter, vol.13, no. 2, pp. 50-64, 2011. [CrossRef] [Google Scholar] [Publisher link]
[13] Carlos Castillo, and Brian Davison, “Adversarial Web search,” Foundations and Trends in Information Retrieval Journal, vol. 4, no.5, pp. 377-486, 2011 [CrossRef] [Google Scholar] [Publisher link]
[14] Mugdha Kolhe, and Disha Bhukte, “Data Mining for Web Spam Detection Analysis of Techniques,” International Journal of Science and Research (IJSR), vol.5, no.10, pp. 1395 – 1399, 2015. [Publisher link]
[15] K. Jino Abisha et al., “Detection of Twitter Spam's using Machine Learning Algorithm,” SSRG International Journal of Computer Science and Engineering, vol. 6, no. 3, pp. 10-13, 2019. [CrossRef] [Publisher link]
[16] Dennis Fetterly, Mark Manasse, and Marc Najork, “Spam, Damn Spam, And Statistics: Using Statistical Analysis to Locate Spam Webpages,” In Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6, 2004. [CrossRef] [Google Scholar] [Publisher link]
[17] Gilad Mishne, David Carmel, and Ronny Lempel, “Blocking Blog Spam with Language Model Disagreement,” In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, 2005. [Google Scholar] [Publisher link]
[18] Krysta Marie Svore et al., “Improving Web Spam Classification Using Rank-time Features,” In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), 2007. [CrossRef] [Google Scholar] [Publisher link]
[19] Yiqun Liu et al., “Identifying Web Spam with User Behavior Analysis,” In Proceedings of the 4th international workshop on Adversarial information retrieval on the web (AIRWeb '08), pp. 9-16, 2008. [CrossRef] [Google Scholar] [Publisher link]
[20] Miklos Erd´elyi, Andras Garz´o, and Andras A. Bencz´ur, “Web Spam Classification: A Few Features worth More,” In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality’11), pp. 27-34, 2011. [CrossRef] [Google Scholar] [Publisher link]
[21] Odukoya Oluwatoyin et al., “An Improved Machine Learning-Based Short Message Service Spam Detection System,” International Journal of Computer Network and Information Security(IJCNIS), vol. 10, no.12, pp. 40-48, 2019. [CrossRef] [Google Scholar] [Publisher link]
[22] Brain D. Davison, “Recognizing Nepotistic Links on the Web,” In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pp. 23–28, 2000. [Google Scholar] [Publisher link]
[23] Einat Amitay et al., “The Connectivity Sonar: Detecting Site Functionality by Structural Patterns,” In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pp. 38–47, 2003. [CrossRef] [Google Scholar] [Publisher link]
[24] James Caverlee, and Ling Liu, “Countering Web Spam with Credibility-Based Link Analysis,” In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing (PODC’07), pp. 157-166, 2007. [CrossRef] [Google Scholar] [Publisher link]
[25] Andras A. Bencz´ur, Karoly Csalog´any, and Tamas Sarl´os, “Link-Based Similarity Search to Fight Web Spam,” In Proceedings of the Second Workshop on Adversarial Information Retrieval on the Web (AIRWeb’06), 2006. [Google Scholar] [Publisher link]
[26] Luca Becchetti et al., “Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection,” In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD’06), 2006. [Google Scholar] [Publisher link]
[27] Xiaofei Niu, Guangchi Liu, and Qing Yang, “Trustworthy Website Detection Based on Social Hyperlink Network Analysis,” IEEE Transactions on Network Science and Engineering, pp. 1-12, 2018. [Google Scholar] [Publisher link]
[28] Dorit S. Hochbaum, Quico Spaen, and Mark Velednitsky, “Detecting Aberrant Linking Behavior in Directed Networks,” In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pp. 72-82, 2019. [CrossRef] [Google Scholar] [Publisher link]
[29] Oluwatoyin Odukoya et al., “Performance Evaluation of User-Behavior Techniques of Web Spam Detection Models,” Network and Complex Systems, vol.10, 2019. [CrossRef] [Google Scholar] [Publisher link]
[30] Zoltan Gyongyi, Hector Garia-Molina, and Jan Pedersen, “Combating Web Spam with TrustRank,” In Proceeding of the Thirtieth International Conference on Very Large Data Bases - VLDB '04, vol. 30, pp. 576-587, 2004. [Google Scholar] [Publisher link]
[31] Carlos Castillo et al., “Know Your Neighbors: Web Spam Detection Using the Web Topology,” In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR), pp. 423–430, 2007. [CrossRef] [Google Scholar] [Publisher link]
[32] Chao Wei et al., “Fighting Against Web Spam: A Novel Propagation Method Based On Click-Through Data,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12), pp. 395-404, 2012. [CrossRef] [Google Scholar] [Publisher link]
[33] B. Wu, and B. Davison, “Detecting Semantic Loaking On the Web,” In Proceedings of the 15th International Conference on World Wide Web (WWW’06), pp. 819-828, 2006.
[34] Sean Si, Keyword Density Tutorial, SEO Hacker School Series. [Online]. Available: https://seo-hacker.com/keyword-density-tutorial/
[35] Jacob Abernethy, Olivier Chapelle, and Carlos Castillo, “Web Spam Identification Through Content and Hyperlinks,” In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 41-44, 2008. [CrossRef] [Google Scholar] [Publisher link]