Afaan Oromo News Text Categorization using Decision Tree Classifier and Support Vector Machine: A Machine Learning Approach

Abstract -
Afaan Oromo is one of the major African languages that is widely spoken and used in most parts of Ethiopia and some parts of other neighbor countries like Kenya and Somalia. It is used by Oromo people, who are the largest ethnic group in Ethiopia, which amounts to 25.5% of the total population. There are large collections of Afaan Oromo document available in web, in addition to hard copy document in library, and documentation centers. Even though the amount of the document increase, there are challenging tasks to identify the relevant documents related to a specific topic. So, a text categorization mechanism is required for finding, filtering and managing the rapid growth of online information. Text categorization is an important application of machine learning to the field of document information retrieval. The objective of this research is to investigate the application of machine learning techniques to automatic categorization of Afaan Oromo news text. Two machine learning techniques, namely Decision Tree Classifier and Support Vector Machine are used to categorize the Afaan Oromo news texts. Annotated news texts are used to train classifiers with six news categories: sport, business, politics, health, agriculture, and education. To design Afaan Oromo news text categorization system, different techniques, and tools are used for preprocessing, document clustering, and classifier model building. In order to preprocess the Afaan Oromo documents, different text preprocessing techniques such as tokenization, stemming, and stop word removal would be used. 824 news texts were used to do this research. To come up with good results text preparation and preprocessing was done. Stop-word was removed from the collection. The 10 fold cross validation was used for testing purposes. The result of this research indicated that such classifiers are applicable to automatically classify Afaan Oromo news texts. The best result obtained by Decision Tree Classifier and Support Vector Machine is on six categories data (96.58, 84.93%) respectively. This research indicated that Decision Tree Classifier is more applicable to automatic categorization of Afaan Oromo news text.

Afaan Oromo, Text, categorization, Classification and Classifier.