Mining Text Data using different Text Clustering Techniques

International Journal of Computer Trends and Technology (IJCTT)          
© 2017 by IJCTT Journal
Volume-43 Number-2
Year of Publication : 2017
Authors : Ratna S. Patil, Prof. B. S. Chordia
DOI :  10.14445/22312803/IJCTT-V43P113


Ratna S. Patil, Prof. B. S. Chordia  "Mining Text Data using different Text Clustering Techniques". International Journal of Computer Trends and Technology (IJCTT) V43(2):87-93, January 2017. ISSN:2231-2803. Published by Seventh Sense Research Group.

Abstract -
Text mining is referred as text data mining or knowledge discovery from textual databases. The organization of text is a natural practice of humans and a crucial task for today’s vast databases. Clustering does this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that provides exploration, search and interpretation of current texts documents, as well as organization of future ones. Side information is available along with the text documents and may be of different kinds, which are embedded into the text document. However this side-information may be difficult to estimate. In such cases, it can be risky to include side-information into the mining process, because it can either increase the quality of the representation for the mining process. Therefore, so as to maximize the advantages from using this side information, to minimize the time complexity of clustering process and to remove impurity of clusters partition based text clustering techniques are used like k-means & k-Windows algorithm. Experimental results show that, K-Windows clustering technique is giving better results as compared to K-means clustering technique and also shows that side information is effectively used for mining the data.

[1] Charu C. Aggarwal, Yuchen Zhao and Philip S. Yu ,”On the Use of Side Information for Mining Text Data”, IEEE transactions on knowledge and data engineering, vol. 26, no. 6, June 2014.
[2] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” in Proc. ACM SIGMOD Conf., New York, NY, USA, 1998, pp. 73–84.
[3] D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather:A cluster-based approach to browsing large document collections,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1992, pp. 318–329.
[4] C.C. Aggarwal and P.S.Yu,“On text clustering with side information,” in Proc. IEEE ICDE Conf., Washington, DC, USA, 2012.
[5] Lior Rokach, Oded Maimon, “Chapter 15 Clustering Methods”, data mining and knowledge discovery handbook.
[6] P. S. Bradley and U. M. Fayyad, Refining initial points for k-means clustering, in ‘‘Proceedings of the IJCAI-93, San Mateo, CA,’’ pp. 1058–1063, 1983.
[7] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining Knowledge Discovery 2 (1998), 283–304.
[8] A. K. Jain and R. C. Dubes, ‘‘Algorithms for Clustering Data,’’ Prentice–Hall, Englewoods Cliffs, NJ, 1988.
[9] D. Judd, P. McKinley, and A. Jain, Large-scale parallel data clustering, in ‘‘Proceedings of Int. Conference on Pattern Recognition,’’ 1996.
[10] C. Pizzuti, D. Talia, and G. Vonella, A divisive initialization method for clustering algorithms, in ‘‘Proc. PKDD 99—Third Europ. Conf. on Principles and Practice of Data Mining and Knowledge Discovery,’’ Lecture Notes in Artificial Intelligence, Vol. 1704, pp. 484–491, Springer-Verlag, Prague, 1999.
[11] Bentley, J. L. (1975). "Multidimensional binary search trees used for associative searching". Communications of the ACM. 18 (9): 509.
[12] C. C. Aggarwal and C.-X. Zhai, “A survey of text classi?cation algorithms,” in Mining Text Data. New York, NY, USA: Springer, 2012.
[13] M. N. Vrahatis, B. Boutsinas, P. Alevizos, and G. Pavlides, The New k-Windows Algorithm for Improving the k-Means Clustering Algorithm, journal of complexity 18, 375–391 (2002).
[15] R. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” in Proc. VLDB Conf., San Francisco, CA, USA, 1994, pp. 144–155.
[16] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” in Proc. ACM SIGMOD Conf., New York, NY, USA, 1996, pp. 103–114.
[17] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in Proc. ACM SIGIR Conf., New York, NY, USA, 2003, pp. 267–273.

Clustering algorithms, Text Mining, Data Mining.