Oral
in
Affinity Workshop: Global South AI
Empowering NLP for African Low-Resource Languages: Leveraging Llama-2 Model for Swahili and Kenyan Dialects
Rancy Chepchirchir
Keywords: [ Kenyan dialects ] [ NLP ] [ Llama 2 ] [ Swahili ] [ Low-resource languages ]
This research is centered on the enhancement of language modeling tailored for African low-resource languages, employing the recently introduced Llama-2 model by META. The primary objective is to address the existing challenges within natural language processing (NLP) for Swahili and other underutilized Kenyan dialects. In the context of contemporary neural network-based language modeling, the demand for data-rich representations has notably escalated. However, the paucity of linguistic data pertinent to low-resource languages, such as Swahili, has precipitated intricacies in the modeling process. This investigation responds to this exigency by harnessing advanced datasets and linguistic reservoirs to rectify this imbalance. The study introduces an unannotated Swahili dataset, meticulously procured through comprehensive preprocessing of raw data, alongside the incorporation of a Swahili syllabic alphabet and a dedicated dataset designed for Swahili word analogy. These contributions not only bolster the efficacy of language modeling but also extend their utility to downstream NLP tasks encompassing part-of-speech tagging, sentiment analysis, and machine translation. Hence, this study underscores the practical import of precise language modeling for languages facing resource constraints. It achieves this by not only showcasing the development of speech-to-text and question-answering systems, thereby charting new trajectories for NLP applications in Swahili, but also accentuating the potential transformative influence of these resources on digital inclusivity, information proliferation, and the emergence of innovative NLP methodologies tailor-made for underprivileged African languages.