Semi Supervised Document Classification Model Using Artificial Neural Networks

  IJCTT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© 2016 by IJCTT Journal
Volume-34 Number-1
Year of Publication : 2016
Authors : Dr.M.Karthikeyan
  10.14445/22312803/IJCTT-V34P109

MLA

Dr.M.Karthikeyan "Semi Supervised Document Classification Model Using Artificial Neural Networks". International Journal of Computer Trends and Technology (IJCTT) V34(1):52-58, April 2016. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract -
Automatic document classification is of paramount importance to knowledge management in the information age. Document classification is a kind of text data mining and organization technique that automatically groups related documents into clusters. Most of the common techniques in document classification are based on the statistical analysis of a term, either word or phrase. Statistical analysis of a term frequency captures the importance of the term within the document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. To solve this problem the proposed system concentrates on an interactive text clustering methodology, semi supervised document classification method using neural networks. There are two main phases in the proposed method: Pre-processing phase and Classification phase. In the pre-processing phase, distinct words are identified and their frequency of occurrences in the document corpus is calculated. These discovered distinct words with their frequency of occurrences, form a document vector. In the classification phase, Back propagation algorithm is used for document classification by using the feature vector of distinct words. The proposed method evaluates the system efficiency by implementing and testing the clustering results with Dbscan and Kmeans clustering algorithms. Experiment shows that the proposed document clustering method performs with an average efficiency of 92% for various document categories.

References
[1] Yuen - Hsien Tseng, Generic title labeling for clustered documents, Expert Systems with Applications, 37(2010) 2247-2254.
[2] Pei-Yi Hao, Jung - Hsien Chiang, Yi – Kun Tu, Hierarchically SVM classification based on support vector clustering method and its application to document categorization, Expert Systems with Applications, 33(2007) 627-635.
[3] Ramiz M. Aliguliyev, Clustering of document collection – A weighting approach, Expert Systems with Applications, 36(2009) 7904-7916.
[4] Linghui Gong, Jianping Zeng, Shiyong Zhang, Text stream clustering algorithm based on adaptive feature selection, Expert Systems with Applications, 38(2011) 1393-1399.
[5] Ridvan Saracoglu, Kemal Tutuncu, Novruz Allahverdi, A fuzzy clustering approach for finding similar documents using a novel similarity measure, Expert Systems with Applications, 33(2007) 600-605.
[6] Ridvan Saracoglu, Kemal Tutuncu, Novruz Allahverdi, A new approach on search for similar documents with multiple categories using fuzzy clustering, Expert Systems with Applications, 34(2008) 2545-2554.
[7] Shih-Cheng Horng, Feng - Yi Yang, Shieh -Shing Lin, Hierarchical fuzzy clustering decision tree for classifying recipes of ion impanter, Expert Systems with Applications, 38(2011) 933-940.
[8] Hung Chim, Xiaotie Deng, Efficient Phrase –Based Document Similarity for Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol 20,No.9(2008).
[9] Shady Shehta, Fakhri Karray, Mohamed S. Kamal, An Efficient Concept-Based Mining Model for Enhancing Text Clustering, IEEE Transactions on Knowledge and Data Engineering, vol. 22, No.10, October 2010.
[10] Hung Chim, Xiaotie Deng, Efficient Phrase –Based Document Similarity for Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol 20,No.9(2008).
[11] Alexander A. Frolov, Dusan Husek, Pavel Yu .Polyakov, Recurrent-Neural – Network Based Boolean Factor Analysis and Its Application to Word Clustering, IEEE Transactions on Neural Networks, Vol 20,No.7(2009).
[12] Cheng Hua Li and Soon Cheol Park, Neural Network for Text Classification Based on Singular Value Decomposition, Seventh International Conference on Computer and Information Technology, 0-7695- 2986-6/07, IEEE, 2007.
[13] Jie Ji, Kunita Daichi and Qiangfu, A Customer Intention Aware System for Document Analysis, 978-1-4244-8126- 2/10,IEEE, 2010.
[14] Tommy W.S. Chow, M.K.M. Rahman, Multilayer SOM with Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection, IEEE Transactions on Neural Networks, Vol 20, No.9, 2009.
[15] Zhonghui Feng, Junpeng Bao, Junyi Shen, Dynamic and Adaptive Self Organizing Maps applied to High Dimensional Large Scale Text Clustering, 978-1-4244- 6055-7/10, IEEE, 2010.
[16] Dino Isa, Rajprasad Rajkumar, Grham Kendall, Document Zone Classification for Technicial Document Images Using Artificial Neural Networks and Support Vector Machines, 978-1-4244-4457-1/09, IEEE, 2009.
[17] Hemalatha.M, Sathya Srinivas. D, Hybrid Neural Network Model for Web Document Clustering, 978-1-4244-4457- 1/09, IEEE, 2009.
[18] M. Karthikeyan, P.Aruna, Probability Based document clustering and Image clustering using Content Based Image Retrieval, Applied Soft Computing, Vol 13, 959- 966,2013.
[19] Kantu. Vijaya Kumar, Abburi. Venkatesh, Multi- Document summarization using phrase context based Indexing and Geometric Model, International Journal of Computer Trends and Technology (IJCTT) – volume 17 Number 5 Nov 2014.
[20] Mulluri Raghupathi, R. Lakshmi Tulasi, Hierarchical Filter based Document Clustering Algorithm, International Journal of Computer Trends and Technology (IJCTT) – volume 21 Number 1 Nov 2014.

Keywords
Artificial Neural Network (ANN), Self Organizing Map(SOM), Back Propagation Networks (BPN), Term frequency, Tokenization, Structural filtering.