Research Article | Open Access | Download PDF
Volume 74 | Issue 5 | Year 2026 | Article Id. IJCTT-V74I5P106 | DOI : https://doi.org/10.14445/22312803/IJCTT-V74I5P106Advancing Low-Resource African Language Technologies: Morphological Feature Integration for Kiswahili Question Answering
Collins S. Wanjala, Lilian Wanzare, Calvins Otieno
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 21 Mar 2026 | 25 Apr 2026 | 16 May 2026 | 31 May 2026 |
Citation :
Collins S. Wanjala, Lilian Wanzare, Calvins Otieno, "Advancing Low-Resource African Language Technologies: Morphological Feature Integration for Kiswahili Question Answering," International Journal of Computer Trends and Technology (IJCTT), vol. 74, no. 5, pp. 50-61, 2026. Crossref, https://doi.org/10.14445/22312803/IJCTT-V74I5P106
Abstract
Low-resource African languages remain critically underrepresented in natural language processing despite serving hundreds of millions of speakers across diverse linguistic communities. This paper addresses whether explicit morphological feature integration can overcome transformer limitations for Kiswahili, an agglutinative Bantu language spoken by over 100 million people across East and Central Africa. The agglutinative nature of Kiswahili presents fundamental challenges to subword tokenization algorithms that break the grammar and require implicit pattern learning using a small amount of data. The research tested vanilla XLM-RoBERTa on the KenSwQuAD question answering dataset, achieving 20.05% F1 and 17.80% Exact Match on the validation set. This weak baseline performance highlighted the significant limitations of traditional multilingual methods with morphologically complex low-resource languages. The study extended XLM-RoBERTa with explicit representations of 17 Kiswahili morphemes to build a morphologically-enhanced architecture, encoded as multi-hot vectors and introduced through learned projection layers and the pre-trained encoder being frozen to maintain multilingual knowledge. The optimized model's F1 and Exact Match scores of 72.40% and 62.91%, respectively, represented substantial improvements of 52.35 percentage points in F1 and 45.11 percentage points in Exact Match from baseline. Rigorous ablation studies demonstrated that improvements were due to the integration of morphological features, not to model capacity. This work demonstrates that explicit linguistic knowledge integration enables competitive performance even with severely limited training data, providing a reproducible framework for morphologically rich under-resourced African languages and challenging prevailing assumptions about the universal applicability of data-driven approaches.
Keywords
African Languages, Agglutinative Morphology, Kenswquad Dataset, Low-Resource Natural Language Processing, Question Answering Systems, Transformer Models.
References
[1] David Ifeoluwa
Adelani et al., “MasakhaNER: Named Entity Recognition for African Languages,” Transactions
of the Association for Computational Linguistics, vol. 9, pp. 1116-1131,
2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] David Ifeoluwa Adelani et
al., “MasakhaNER 2.0: Africa-Centric Transfer Learning for Named Entity
Recognition,” Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pp. 4488-4508, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jesujoba O. Alabi et al.,
“Adapting Pre-Trained Language Models to African Languages via Multilingual
Adaptive Fine-Tuning,” Proceedings of the 29th International
Conference on Computational Linguistics, pp. 4336-4349, Gyeongju, Republic
of Korea, 2022.
[Google Scholar] [Publisher Link]
[4] Kaj Bostrom, and Greg
Durrett, “Byte Pair Encoding is Suboptimal for Language Model Pretraining,” Findings
of the Association for Computational Linguistics: EMNLP 2020, pp.
4617-4624, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Jonathan H. Clar et al.,
“TyDi QA: A Benchmark for Information-Seeking Question Answering in
Typologically Diverse Languages,” Transactions of the Association for
Computational Linguistics, vol. 8, pp. 454-470, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Alexis Conneau et al.,
“Unsupervised Cross-Lingual Representation Learning at Scale,” Proceedings
of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 8440-8451, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Jacob Devlin et al., “BERT:
Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proceedings
of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, vol. 1, pp.
4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Pratik Joshi et al., “The
State and Fate of Linguistic Diversity and Inclusion in the NLP World,” Proceedings
of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 6282-6293, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Gati Martin et al.,
“SwahBERT: Language Model of Swahili,” Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Seattle, United States, pp. 303-313, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Kelechi Ogueji, Yuxin Zhu,
and Jimmy Lin, “Small Data? No Problem! Exploring the Viability of Pretrained
Multilingual Language Models for Low-Resourced Languages,” Proceedings of
the 1st Workshop on Multilingual Representation Learning, pp.
116-126, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Odunayo Ogundepo et al.,
“AfriQA: Cross-Lingual Open-Retrieval Question Answering for African
Languages,” Findings of the Association for Computational Linguistics: EMNLP
2023, pp. 14957-14972, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Pranav Rajpurkar et al.,
“SQuAD: 100,000+ Questions for Machine Comprehension of Text,” Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing,
pp. 2383-2392, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Barack W. Wanjawa et al.,
“KenSwQuAD: A Question Answering Dataset for Swahili Low-Resource Language,” ACM
Transactions on Asian and Low-Resource Language Information Processing,
vol. 22, no. 4, pp. 1-20, 2023.
[CrossRef] [Google Scholar] [Publisher Link]