Oral
in
Affinity Workshop: Global South AI
Linguistic Colonialism in the Age of Large Language Models: A Need for Diverse and Inclusive Regional Language Considerations
Sundaraparipurnan Narayanan
Keywords: [ Datasets ] [ benchmark ] [ LLM ]
LLMs are contributing to a growing issue of linguistic colonialism, emphasizing the need for socially responsible approaches to safeguard low-resource and regional languages. The current tendency to prioritize English-centric models, training data, and evaluation benchmark datasets poses a potential threat to language equality and the preservation of linguistic diversity. Large Language Models (LLMs), such as LLAMA2 and FALCON, are predominantly trained on English text. The Llama2 and FALCON models provides explicit information about the language composition of its training dataset, with English accounting for 89% and 100% of the data respectively. However, the technical report for GPT4 does not explicitly mention its training dataset's language(s). The evaluations and testing of these models are primarily conducted within an English environment, hence constraining their practical relevance to languages other than English. The benchmarks used in LLAMA 2 for commonsense reasoning, world knowledge, reading comprehension were in English. Similarly, the benchmarks used for GPT4 were in english including the Uniform Bar Exam, LSAT, and GRE. The LLAMA2 suggests that its applicability in other languages may be compromised, due to it being predominantly trained on English. On the other hand GPT 4, notes that, 24 out of 26 non-English languages exhibited better MMLU benchmarks (with auto translated benchmark datasets) against results obtained by GPT 3.5 in English without providing detailed information regarding the sample size used for evaluation. This study proposes a shift in perspective towards the development of language models and associated benchmark datasets that are designed with inclusive regional language considerations. The study wants to create separate MMLU (Massive Multitask Language Understanding) validation sets for nine languages: Italian, Spanish, French, German, Russian, Arabic, Bengali, Urdu, and Marathi. These are the nine languages that were looked at in the GPT4 study. In addition, the research proposes to build validation sets for six additional languages:Tamil, Bahasa, Hindi, Kannada, Gujarati, and Portuguese. Regional questions and considerations will be developed for every language instead of translation versions of the MMLU for English.