Classification of Spam Categorization on Hindi Documents using Bayesian Classifier

Mr.Ishaan Tamhankar; Dr.Ashysh Chaturvedi

doi:10.14445/22312803/IJCTT-V66P102

Research Article | Open Access | Download PDF

Volume 66 | Number 1 | Year 2018 | Article Id. IJCTT-V66P102 | DOI : https://doi.org/10.14445/22312803/IJCTT-V66P102

Classification of Spam Categorization on Hindi Documents using Bayesian Classifier

Mr.Ishaan Tamhankar, Dr.Ashysh Chaturvedi

Citation :

Mr.Ishaan Tamhankar, Dr.Ashysh Chaturvedi, "Classification of Spam Categorization on Hindi Documents using Bayesian Classifier," International Journal of Computer Trends and Technology (IJCTT), vol. 66, no. 1, pp. 8-13, 2018. Crossref, https://doi.org/10.14445/22312803/IJCTT-V66P102

Abstract

In the current e-world, mostly all the transactions and the business are taking place through e-mails. Now a day, e-mail has become a powerful tool for communication as it saves a lot of time, paper and cost. But, due to social networks sites and advertiser most of the e-mails are containing unwanted information i.e. called spam. The spam e-mails may contain text of any languages.[3] On the web there are some documents that contain Indian language which may be a spam e-mail. As there are various languages available in India it is a challenging task to identify the spam e-mail due to its linguistic variance and language barriers. As I have reviewed so many research papers on E-mail Spam Categorization, I found that there are so many classifiers available for all the Indian Language, but there is no document classifier available for Hindi language. So in my research I am going to focus on document classifier for Hindi Spam E-Mail Categorization.

Keywords

Hindi Language, Naïve Bayes (NB), Document Categorization, Support Vector Machines (SVM) and K-NN (K – Nearest Neighbors).

References

[1] Lin SH, Chen M C, Ho JM, Huang YM. ACIRD: Intelligent Internet document organization and retrieval. IEEE Transactions on Knowledge and Data Engineering. 2002; 14(3):599–614.https://doi.org/10.1109/ TKDE.2002.1000345
[2] Lee LH, Isa D. automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Systems with Applications, 2010; 37(12):8471–8. https://doi.org/10.1016/j.eswa.2010.05.030
[3] Zhang H. The Optimality of Naive Bayes. Barr V, Markov Z, editors. FLAIRS Conference; AAAI Press; 2004.
[4] Patil JJ, Bogiri N. Automatic text categorization Marathi documents. International Journal of Advance Research in Computer Science and Management Studies. 2015; 3(3):280–7. https://doi.org/10.1109/icesa.2015.7503438
[5] Patil M, Game P. Comparison of Marathi text classifiers. ACEEE International Journal on Information Technology. 2014; 4(1):11–22.
[6] mandal ak, sen r. supervised learning method for bangla web Document Categorization. International Journal of Artificial Intelligence and Applications. 2014; 5(5):93–105. https://doi.org/10.5121/ijaia.2014.5508
[7] Murthy VG, Vardhan BV, Sarangam K, Reddy PVP. A comparative study on term weighting methods for automated Telugu text categorization with effective classifiers. International Journal of Data Mining and Knowledge Management Process. 2013; 3(6):95. https://doi. org/10.5121/ijdkp.2013.3606
[8] Swamy MN, Hanumanthappa M. Indian language text representation and categorization using supervised learning algorithm. International Journal of Data Mining Techniques and Applications. 2013; 2:251–7.
[9] Naseeb N, Gupta V. Domain based classification of punjabi text documents using ontology and hybrid based approach. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing COLING; 2012. p. 109–122.
[10] Rajan K, Ramalingam V, Ganesan M, Palanivel S, Palaniappan B. Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Systems with Applications. 2009, 36(8):10914–8. https://doi.org/10.1016/j.eswa.2009.02.010
[11] Raghuveer K, Murthy KN. Text categorization in Indian languages using machine learning approaches. IICAI; 2007. p. 1864–83.
[12] Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2002; 10:79–86.
[13] Rogati M, Yang Y. High-performing feature selection for text classification. Proceedings of the 11th International Conference on Information and Knowledge Management; 2002. p. 659–61. https://doi.org/10.1145/584792.584911
[14] Forman G. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research. 2003; 3:1289–305.
[15] Tan S, Zhang J. An empirical study of sentiment analysis for Chinese documents. Expert Systems with Applications. 2008; 34(4):2622–9. https://doi.org/10.1016/j. eswa.2007.05.028
[16] Prabowo R, Thelwall M. Sentiment analysis: A combined approach. Journal of Informetrics. 2009; 3(2):143–57. https://doi.org/10.1016/j.joi.2009.01.003
[17] Alsaleem S. Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology. 2011; 2(2):124–8.
[18] El Kourdi M, Bensaid A, Rachidi TE. Automatic Arabic document categorization based on the Naïve Bayes algorithm. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Association for Computational Linguistics; 2004. p. 51–8. https://doi. org/10.3115/1621804.1621819
[19] Hadni M, Lachkar A, Ouatik SA. A new and efficient stemming technique for Arabic text categorization. 2012 International Conference on Multimedia Computing and Systems (ICMCS); 2012. p. 791–6. https://doi.org/10.1109/ ICMCS.2012.6320308
[20] Harrag F, El-Qawasmah E, Al-Salman AMS. Stemming as a feature reduction technique for Arabic text categorization. 2011 10th International Symposium on Programming and Systems (ISPS); 2011. p. 128–33.
[21] Halder T, Karforma S, Mandal R. A novel data hiding approach by pixel-value-difference steganography and optimal adjustment to secure e-governance documents.Indian Journal of Science and Technology. 2015 Jul; 8(16):1–7. https://doi.org/10.17485/ijst/2015/v8i16/51269
[22] Prakash KB. Mining issues in traditional Indian web documents. Indian Journal of Science and Technology. 2015 Nov; 8(32):1–11.
[23] Antipov KV, Vinokur AI, Simakov SP, Isakov YV, Kazakova AY. Digitization of Russian parish registers of the 18-20th centuries as the contribution to the cultural foundation of historical documents. Indian Journal of Science and Technology. 2015 Dec; 8(10):1–10. https://doi. org/10.17485/ijst/2015/v8is(10)/87462
[24] Posonia AM, Jyothi VL. Context-based classification of XML documents in feature clustering. Indian Journal of Science and Technology. 2014 Jan; 7(9):1–4.
[25] Karthika S, Sairam N. A naïve bayesian classifier for educational qualification. Indian Journal of Science and Technology. 2015,Jul;8(16):1–5. https://doi.org/10.17485/ ijst/2015/v8i16/62055
[26] Sarangi PK, Ahmed P, Ravulakollu KK. Naïve Bayes classifier with LU factorization for recognition of handwritten Odia numerals. Indian Journal of Science and Technology. 2014 Jan; 7(1):1–4.