Poster
in
Affinity Event: Black in AI
Evaluating Multilingual Dense Embedding Models and a Sparse Model for Information Retrieval in Yoruba: A Comparative Study
Adejumobi Joshua · Anthony Soronnadi · Olubayo Adekanmbi
Selecting the appropriate embedding model for information retrieval tasks is critical, particularly when considering the language of the documents. There are numerous multilingual dense embedding models available today, each demonstrating varied levels of performance across different languages. Although most of these models are pretrained on high-resource languages, several also incorporate significant data from low-resource languages. This study assesses the performance of several open-source multilingual embedding models—BGE-M3, ML-E5-large, LaBSE, and E5-mistral-7b—in the context of information retrieval in Yoruba, a low-resource language. Additionally, we evaluated the performance of the BM25 sparse retrieval model, which achieved a Mean Reciprocal Rank (MRR) of 0.40, providing a baseline comparison. Our results indicate that BGE-M3 is the top performer, achieving an MRR accuracy of 0.62, with ML-E5-large close behind at 0.59. E5-mistral-7b and LaBSE follow, with MRRs of 0.43 and 0.18, respectively. This analysis highlights the comparative effectiveness of dense versus sparse retrieval models in processing queries in a low-resource language like Yoruba.