Different Type of Feature Selection for Text Classification

M.Ramya; J.Alwin Pinakas

doi:https://doi.org/10.14445/22312803/IJCTT-V10P118

Research Article | Open Access | Download PDF

Volume 10 | Number 1 | Year 2014 | Article Id. IJCTT-V10P118 | DOI : https://doi.org/10.14445/22312803/IJCTT-V10P118

Different Type of Feature Selection for Text Classification

M.Ramya , J.Alwin Pinakas

Citation :

M.Ramya , J.Alwin Pinakas, "Different Type of Feature Selection for Text Classification," International Journal of Computer Trends and Technology (IJCTT), vol. 10, no. 1, pp. 102-107, 2014. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V10P118

Abstract

Text categorization is the task of deciding whether a document belongs to a set of pre specified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many numbers of features. For most text categorization tasks, there are many irrelevant and many relevant features. The main objective is to propose a text classification based on the features selection and pre-processing thereby reducing the dimensionality of the Feature vector and increase the classification accuracy. In the proposed method, machine learning methods for text classification is used to apply some text preprocessing methods in different dataset, and then to extract feature vectors for each new document by using various feature weighting methods for enhancing the text classification accuracy. Further training the classifier by Naive Bayesian (NB) and K-nearest neighbor (KNN) algorithms, the predication can be made according to the category distribution among this k nearest neighbors. Experimental results show that the methods are favorable in terms of their effectiveness and efficiency when compared with other.

Keywords

Feature selection, K-Nearest Neighbor, Naïve Bayesian, Text classification.

References

[1] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification“ Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE TRANS ON Knowledge and Data Eng.,Vol 23,No.3,March 2011
[2] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi,and Z. Chen, “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing,” IEEE Trans.Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.
[3] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” T. Sebastian, S.Lawrence, and S. Bernhard eds. Advances in Neural Information Processing System, pp. 97-104, Springer, 2004.
[4] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language,pp. 212-217, 1992.
[5]http://kdd.ics.uci.edu/databases/ reuters21578/reuters21578.html. 2010.
[6] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification with Support Vector Machines,” J. Machine Learning Research, vol. 6, pp. 37-53, 2005.
[7] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[8] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares,” BIT Numerical Math, vol. 43, pp. 427-448, 2003.