Advancing Low-Resource African Language Technologies: Morphological Feature Integration for Kiswahili Question Answering

Collins S. Wanjala; Lilian Wanzare; Calvins Otieno

doi:10.14445/22312803/IJCTT-V74I5P106

Research Article | Open Access | Download PDF

Volume 74 | Issue 5 | Year 2026 | Article Id. IJCTT-V74I5P106 | DOI : https://doi.org/10.14445/22312803/IJCTT-V74I5P106

Advancing Low-Resource African Language Technologies: Morphological Feature Integration for Kiswahili Question Answering

Collins S. Wanjala, Lilian Wanzare, Calvins Otieno

Received	Revised	Accepted	Published
21 Mar 2026	25 Apr 2026	16 May 2026	31 May 2026

Citation :

Collins S. Wanjala, Lilian Wanzare, Calvins Otieno, "Advancing Low-Resource African Language Technologies: Morphological Feature Integration for Kiswahili Question Answering," International Journal of Computer Trends and Technology (IJCTT), vol. 74, no. 5, pp. 50-61, 2026. Crossref, https://doi.org/10.14445/22312803/IJCTT-V74I5P106

Abstract

Low-resource African languages remain critically underrepresented in natural language processing despite serving hundreds of millions of speakers across diverse linguistic communities. This paper addresses whether explicit morphological feature integration can overcome transformer limitations for Kiswahili, an agglutinative Bantu language spoken by over 100 million people across East and Central Africa. The agglutinative nature of Kiswahili presents fundamental challenges to subword tokenization algorithms that break the grammar and require implicit pattern learning using a small amount of data. The research tested vanilla XLM-RoBERTa on the KenSwQuAD question answering dataset, achieving 20.05% F1 and 17.80% Exact Match on the validation set. This weak baseline performance highlighted the significant limitations of traditional multilingual methods with morphologically complex low-resource languages. The study extended XLM-RoBERTa with explicit representations of 17 Kiswahili morphemes to build a morphologically-enhanced architecture, encoded as multi-hot vectors and introduced through learned projection layers and the pre-trained encoder being frozen to maintain multilingual knowledge. The optimized model's F1 and Exact Match scores of 72.40% and 62.91%, respectively, represented substantial improvements of 52.35 percentage points in F1 and 45.11 percentage points in Exact Match from baseline. Rigorous ablation studies demonstrated that improvements were due to the integration of morphological features, not to model capacity. This work demonstrates that explicit linguistic knowledge integration enables competitive performance even with severely limited training data, providing a reproducible framework for morphologically rich under-resourced African languages and challenging prevailing assumptions about the universal applicability of data-driven approaches.

Keywords

African Languages, Agglutinative Morphology, Kenswquad Dataset, Low-Resource Natural Language Processing, Question Answering Systems, Transformer Models.

References

[1] David Ifeoluwa Adelani et al., “MasakhaNER: Named Entity Recognition for African Languages,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1116-1131, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[2] David Ifeoluwa Adelani et al., “MasakhaNER 2.0: Africa-Centric Transfer Learning for Named Entity Recognition,” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4488-4508, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Jesujoba O. Alabi et al., “Adapting Pre-Trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning,” Proceedings of the 29^th International Conference on Computational Linguistics, pp. 4336-4349, Gyeongju, Republic of Korea, 2022.
[Google Scholar] [Publisher Link]

[4] Kaj Bostrom, and Greg Durrett, “Byte Pair Encoding is Suboptimal for Language Model Pretraining,” Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4617-4624, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Jonathan H. Clar et al., “TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454-470, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Alexis Conneau et al., “Unsupervised Cross-Lingual Representation Learning at Scale,” Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics, pp. 8440-8451, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Pratik Joshi et al., “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics, pp. 6282-6293, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Gati Martin et al., “SwahBERT: Language Model of Swahili,” Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, pp. 303-313, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-Resourced Languages,” Proceedings of the 1^st Workshop on Multilingual Representation Learning, pp. 116-126, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Odunayo Ogundepo et al., “AfriQA: Cross-Lingual Open-Retrieval Question Answering for African Languages,” Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14957-14972, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Pranav Rajpurkar et al., “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383-2392, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Barack W. Wanjawa et al., “KenSwQuAD: A Question Answering Dataset for Swahili Low-Resource Language,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, pp. 1-20, 2023.
[CrossRef] [Google Scholar] [Publisher Link]