Text Classification using Bi-Gram Alphabet Document Vector Representation

International Journal of Computer Trends and Technology (IJCTT)          
© 2018 by IJCTT Journal
Volume-60 Number-2
Year of Publication : 2018
Authors : Fatma Elghannam
DOI :  10.14445/22312803/IJCTT-V60P114


Fatma Elghannam "Text Classification using Bi-Gram Alphabet Document Vector Representation". International Journal of Computer Trends and Technology (IJCTT) V60(2):91-98 June 2018. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Text classification TC is the process of assignment of text documents to appropriate categories based on their content. High dimensionality of feature space is a primary challenge in TC. The most common approach for TC is bag of words BOW which is limited due to the continuous increase in the number of features as the volume of vocabulary increases. Many investigators have addressed the issue of management of dimensionality by applying careful preprocessing techniques that include complex morphological phase, in particular for the high inflectional languages including Arabic. In the present study, term frequency of bi-gram alphabet is used to construct document vector. A main contribution of bi-gram alphabet approach is that feature terms are standard and separate from documents contents; this helps to reduce the high dimensionality associated with the increasing the volume of data. In addition, the classification process performs well on both Arabic and English collections without morphological preprocessing requirements. The proposed approach has proved high accuracy results and outperformed other Arabic TC systems.

[1] F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), 34(1), 1-47, 2002.
[2] S. Khorsheed, O. Al-Thubaity, “Comparative evaluation of text classification techniques using a large diverse Arabic dataset,” Language resources and evaluation, 47(2), 513-538, 2013.
[3] T. Kanan, A. Fox, “Automated Arabic text classification with PStemmer,” machine learning, and a tailored news article taxonomy. Journal of the Association for Information Science and Technology, 67(11), 2667-2683, 2016.
[4] M. M. Syiam, Z. T. Fayed, M. B. Habib, “An intelligent system for Arabic text categorization,” International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19, 2006.
[5] J. Diederich, J. L. Kindermann, E. Leopold, G. PAAß, Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123, 2003.
[6] A. Mesleh, “Chi square feature extraction based Svms Arabic language text categorization system,” Journal of Computer Science, 3(6), 430–435, 2007.
[7] F. Thabtah, M. Eljinini, M. Zamzeer, W. Hadi, “Na?¨ve Bayesian based on Chi Square to categorize Arabic data,” In Proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and knowledge Management in Twin Track Economies, 2009, pp. 930–935.
[8] L. Khreisat, “Arabic text classification using N-gram frequency statistics a comparative study,” In Proceedings of the 2006 International Conference on Data Mining, 2006, pp. 78–82.
[9] H. Sawaf, J. Zaplo, H. Ney, “Statistical classification methods for Arabic news articles,” Arabic Natural Language Processing Workshop, ACL?2001, 2001, pp. 127–132.
[10] M. M. Zahran, G. Kanaan, M. B. Habib, “Text feature selection using particle Swarm optimization algorithm,” World Applied Sciences Journal, 7 (Special Issue of Computer , IT), 69–74, 2009.
[11] G. Salton, C. Buckley, “Term-weighting approaches in automatic text retrieval”, Information processing management, 24(5), 513-523, 1988.
[12] A. El-Halees,”A comparative study on Arabic text classification,” Egyptian Computer Science Journal, 30(2), 2008.
[13] I. Guyon, A. Elisseeff, “An introduction to variable and feature selection. Journal of machine learning research,” 3(Mar), 1157-1182, 2003.
[14] H. K. Chantar, D. W. Corne, “Feature subset selection for Arabic document categorization using BPSO-KNN”, In Nature and Biologically Inspired Computing (NaBIC), 2011 Third World Congress on (pp. 546-551). IEEE
[15] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” In European conference on machine learning, 1998, pp. 137-142, Springer, Berlin, Heidelberg.
[16] S. Bahassine, A. Madani, M. Kissi, “Arabic Text Classification Using New Stemmer for Feature Selection and Decision Trees,” Journal of Engineering Science and Technology, 12(6), 1475-1487, 2017.
[17] S. Oraby, Y. El-Sonbaty, M. A. El-Nasr, “Exploring the effects of word roots for Arabic sentiment analysis,” In Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, (pp. 471-479).
[18] S. Khoja, “APT: Arabic part-of-speech tagger,” In Proceedings of the Student Workshop at NAACL, 2001, pp. 20-25.
[19] K. Taghva, R. Elkhoury, J. Coombs, “Arabic stemming without a root dictionary,” In Information Technology: Coding and Computing, 2005, ITCC 2005. International Conference on (Vol. 1, pp. 152-157). IEEE.
[20] Tashaphyne (2010) Arabic light stemmer, [Online]. Available: http://tashaphyne.sourceforge.net/.
[21] E. Al-Thwaib, “Text summarization as feature selection for Arabic text classification,” World of Computer Science and Information Technology Journal (WCSIT), 4(7), 101-104, 2014.
[22] R. Al-Shalabi, R. Obeidat, “Improving KNN Arabic text classification with n-grams based document indexing,” Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, 108-112, 2008.
[23] S. A.Yousif, V. W. Samawi, I. Elkabani, “Arabic Text Classification: The Effect of the AWN Relations Weighting Scheme,” In Proceedings of the World Congress on Engineering Vol. 2 , 2017.
[24] M. M. Al-Tahrawi, S. N. Al-Khatib, “Arabic text classification using Polynomial Networks,”Journal of King Saud University-Computer and Information Sciences, 27(4), 437-449, 2015.
[25] N.Anitha, B. Anitha, S. Pradeepa, “Sentiment Classification Approaches,” International Journal of Innovation Engineering and Technology, 3(1), pp. 22-31, 2013.
[26] N. Cristianini, J. Shawe-Taylor, “An introduction to support vector machines and other kernel-based learning methods,” Cambridge university press, 2000.
[27] I. H. Witten, E. Frank, M.A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[28] I. H. Witten, E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
[29] Rapid Miner Project RM (2013). The Rapid Miner Project for Machine Learning. Available: http://rapid-i.com. Last access on December 2017
[30] Arabic Corpora - Mourad Abbas. (2004). Available: https://sites.google.com/site/mouradabbas9/corpora. Last access on January 2018.
[31] Arabic Corpora - Alj-News.(2004). Available: https://filebox.vt.edu/users/dsaid/Alj-News.tar.gz. Last access on January 2013.
[32] Saad, M. K., Ashour, W. Osac: Open source Arabic corpora. In 6th ArchEng Int. Symposiums, EEECS (Vol. 10) , 2010.
[33] Open-source BBC Dataset, available at: http://mlg.ucd.ie/datasets/bbc.html.

Text classification, Arabic document, bi-gram alphabet, feature selection, support vector machine.