Giving Structure to Unstructured Text Data by Employing Classification

Ngetich Ngor Gogo; Matthias Daniel; Alabo Gift.

doi:10.14445/22312803/ IJCTT-V69I2P104

Research Article | Open Access | Download PDF

Volume 69 | Issue 2 | Year 2021 | Article Id. IJCTT-V69I2P104 | DOI : https://doi.org/10.14445/22312803/IJCTT-V69I2P104

Giving Structure to Unstructured Text Data by Employing Classification

Ngetich Ngor Gogo, Matthias Daniel, Alabo Gift.

Received	Revised	Accepted
15 Dec 2020	27 Jan 2021	30 Jan 2021

Citation :

Ngetich Ngor Gogo, Matthias Daniel, Alabo Gift., "Giving Structure to Unstructured Text Data by Employing Classification," International Journal of Computer Trends and Technology (IJCTT), vol. 69, no. 2, pp. 22-28, 2021. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V69I2P104

Abstract

As relevant as the need to have information readily available and well manage; quite a volume of information are inaccessible and locked up in a huge volume of text documents (unstructured data) that could be applied in the economy by the government, individuals, and corporate organization to ameliorate on the state of life and develop better working system; this cannot be overemphasized, therefore the need to extract this information and give a structure that will expedite adequate management, storage, and access when required because of their importance. The aim of this research is to implement a Classification Algorithm as a technique for giving Structure to Unstructured Data (Text document). The Multinomial Naïve Bayes classifier Algorithm was deployed for the purpose of classifying these unstructured data to give structure to it. There are two major phases involved in this: first is the pre-processing phase (Tokenization, Stemming, and Stop Word Removal), and second the Classification phase. The system built performed better, as shown from the result, that it can be used to classify text documents for proper and easy management, storage, and accessibility.

Keywords

Structure, Unstructured data, Classification, Multinomial Naïve Bayes classifier, Algorithm, pre-processing

References

[1] Chakraborty, G., Pagolu, M., Text Mining and Analysis, Practical Methods, Examples and Case studies using SAS. SAS Institute Inc. Press, Cary, North Carolina, USA., (2014).
[2] Goutam Chakraborty., Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining. Department of Marketing, Spears School of Business, Oklahoma State University., (2015).
[3] Praveen, P. & Rama, B., A k-means Clustering Algorithm on Numeric Data. International Journal of Pure and Applied Mathematics 117(7)(2017) 157-164 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version).
[4] Fore, N. K., A Contrast Pattern-Based Clustering Algorithm for Categorical Data, Wright State University Core Scholar., (2010).
[5] Jyotismita, G., A comparative Study on clustering and classification Algorithms, International Journal of Scientific and Applied Science (IJSEAS) 1(3)(2015) 70-177
[6] Fredrick, J. & Leonardo S., Data Clustering, its application, and benefits, Semantic Scholar, (2017).
[7] Russell, P., Jay, C., Trishank, K., & Lakshminarayanan S., Document Classification for Focused Topics. International Journal of Computer Applications, 31(5)(2010).
[8] Anna, J. K., Data Clustering: 50 Years Beyond K-Means, King-Sun Fu Prize lecture delivered at the 19th International Conference on Pattern Recognition (ICPR), Tampa, FL, December 8, (2008).
[9] Quan, Y., Gao, C., & Nadia, M., Enhancing Naive Bayes with Various Smoothing Methods for Short Text Classification, (2013).
[10] Phimphaka, T., & Sudsanguan, N., Incremental Adaptive Spam Mail Filtering Using Naïve Bayesian Classification, 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking, and Parallel/Distributed Computing. Materials Research, (2011) 171-172,543-546.
[11] Shomona, G. J., & Geetha, R., Discovery of Knowledge Patterns in Clinical Data through Data Mining Algorithms: Multiclass Categorization of Breast Tissue Data, International Journal of Computer Applications,32(7)(2011) 201-213.
[12] Managwu, C., Matthias, D. and Nwiabu N., Random Forest Regression Model for Estimation of Neonatal Levels in Nigeria. SSRG International Journal of Computer Science and Engineering, 8(20)(2020) 1-4.
[13] Thaoroijam, K., A Study on Document Classification using Machine Learning Techniques. International Journal of Computer Science 11(1)(2014) 165-172.