Poster
in
Workshop: Algorithmic Fairness through the lens of Metrics and Evaluation
Multilingual Hallucination Gaps in Large Language Models
ClĂ©a Chataigner · Afaf Taik · Golnoosh Farnadi
Keywords: [ NLP ] [ Evaluation Metrics and Techniques ]
Sat 14 Dec 9 a.m. PST — 5:30 p.m. PST
Large language models (LLMs) are increasingly being used as alternatives to traditional search engines because of their capacity to generate text that resembles human language. However, this shift is concerning, as LLMs often generate hallucinations—misleading or false information that appears highly credible.In this study, we explore the phenomenon of hallucinations across multiple languages, focusing on what we call multilingual hallucination gaps. These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used. To measure hallucinations, we used the FactScore metric and extended its framework to a multilingual setting. We conducted experiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographies in 19 languages and comparing the results to Wikipedia pages. Our results reveal significant variations in hallucination rates, especially between high- and low-resource languages, raising concerns about hallucination performance of LLMs across different languages.