
Despite the rapid advancements in large language model (LLM) capabilities, a significant “language gap” persists, leaving the vast majority of the world’s over 7,000 languages underserved and creating critical disparities in AI safety. A new preprint paper by researchers primarily from Cohere and Cohere Labs meticulously outlines why this gap exists, how it’s widening, and its profound impact on global AI safety, urging policymakers and the AI community to take concerted action.
The paper, titled “The Multilingual Divide and Its Impact on Global AI Safety,” authored by Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Sara Hooker, and a team of collaborators, was released as a preprint on May 28, 2025. It serves as both an analysis and a call to action, highlighting that AI models predominantly reflect English and Western-centric viewpoints, thereby marginalizing other cultural perspectives and introducing safety vulnerabilities for non-English speakers.
Quoting philosopher Ludwig Wittgenstein, “The limits of my language means the limits of my world,” the report underscores how current AI systems, by primarily focusing on a handful of globally dominant languages, are inadvertently limiting the worlds of billions. This isn’t just an issue of accessibility; it’s a critical safety concern that is often overlooked in mainstream AI safety initiatives.
The research delves into the multifaceted reasons behind the current state of multilingual AI. The language gap is not a simple oversight but a result of systemic issues:
- Resource disparities: High-quality text-based datasets, crucial for training capable LLMs, are overwhelmingly available for English and a few other high-resource languages. Instruction-fine-tuning datasets are almost entirely English-focused. The paper illustrates this with data from HuggingFace and Wikipedia, showing a stark contrast in available resources across languages.
- The “low-resource double-bind”: Many regions with low-resource languages also face significant compute constraints, amplifying the challenge of developing and even evaluating LLMs for their specific linguistic needs.
- Socio-econo-linguistic factors: Participation in AI research and dataset creation is skewed. Fewer economic incentives, lack of institutional support, political instability, and high barriers to entry often hinder contributions from communities speaking less-resourced languages. The report cites examples from Cohere Labs’s own Aya initiative where geopolitical realities like power outages and civil conflicts directly impacted volunteer contributions from certain regions.
- Data quality limitations: Beyond quantity, the quality of available data for low-resource languages is often insufficient for robust model development. While data pruning techniques can improve performance, they might not generalize well across all languages.
- Lack of transparency: Many AI model developers do not clearly articulate the languages their models support or provide performance and safety evaluations across different languages. This makes fair comparisons and targeted improvements difficult.
This English-centric development means that even the underlying “concept space” of many LLMs is more aligned with English, potentially leading to biases against other cultural perspectives.
The report warns that the language gap risks creating a vicious cycle. Advanced techniques like using synthetic data (AI-generated data for training other AIs) and LLM-based evaluations inherently favor languages already well-supported, further marginalizing low-resource languages. This has several critical consequences:
- Widening cost of access: Using LLM-based technologies can be more expensive for speakers of non-English, particularly non-Latin script languages, as they often require more tokens to encode the same information (as shown with ChatGPT in the paper).
- Communities left behind: As LLMs become integral to economies and services, communities whose languages are not covered will face significant disadvantages, potentially worsening existing global inequities.
- Reduced cultural diversity: AI outputs often reflect Anglo-centric and North American viewpoints, failing to account for diverse cultural experiences, social histories, and local nuances.
- Compromised AI safety: This is a major focus of the paper. A lack of multilingual safety testing and mitigation means models can produce harmful, biased, or irrelevant outputs in non-English languages. The report cites research showing GPT-4 tends to produce more harmful content in low-resource languages. Furthermore, multilingual prompts can sometimes be used to subvert safety guardrails even in generally robust models, posing a risk to all users. Current AI safety efforts, including high-profile international initiatives, often lack explicit focus on multilingual contexts.
Lessons from Cohere Labs’ Aya initiative
The paper draws extensively on the experiences and learnings from Cohere Labs’s Aya initiative, a large-scale participatory machine learning research project involving over 3,000 collaborators from 119 countries. Aya aims to increase access to state-of-the-art AI for all languages. Its inaugural Aya 101 release significantly expanded language coverage in AI, providing a massive multilingual instruction fine-tuning dataset.
Key lessons highlighted from the Aya project include:
- Lesson #1: Data availability is a potent lever: Increasing data coverage by combining human-curated, synthetically generated, and translated data is more effective than solely prioritizing human annotations, especially for rare languages where volume from translation outweighs quality trade-offs.
- Lesson #2: Build evaluation sets alongside models: Creating trusted benchmarks is crucial. This involves:
- Language-parallel sets (like Global-MMLU, which translates questions across languages) for direct comparison, ideally with human post-edits to avoid “translationese.”
- Language-specific sets (like INCLUDE) to capture local nuances and cultural knowledge, addressing findings that benchmarks like MMLU have significant Western-centric biases.
- Safety evaluations that account for local context, distinguishing “global” harms from “local” ones, as done in the Aya Red-teaming dataset.
- Evaluations for generative use cases across modalities, like the Aya Vision Bench for open-ended Vision-Language Model (VLM) evaluation.
- Lesson #3: Collaborate cross-institutionally: Addressing the nuances of dialects and regional cultures requires involving local communities and multidisciplinary experts. Open science collaborations like Masakhane, NusaCrowd, and Aya itself are vital.
- Lesson #4: Focus on improving multilingual performance through technical innovation: Breakthroughs like multilingual preference training, model merging (which can build stronger, safer multilingual systems more cheaply than full finetuning), and safety context distillation (which significantly reduced harmful generations in Aya models) are critical.
- Lesson #5: Tackle harmful content as it evolves: Toxicity mitigation needs to be continuous and expand beyond English-centric approaches as language and its harmful uses change.
- Lesson #6: Access to technology matters: Development tools and data collection platforms must be accessible across various devices, operating systems, and internet connectivity levels to ensure broad global participation.
Key barriers to multilingual AI safety
The paper articulates three overarching barriers that hinder efforts to close the language disparity and associated AI safety gaps:
- Building high-quality datasets and curating evaluations with fluent speakers is resource-intensive but critical.
- Access to compute power is uneven globally, often reinforcing the “low-resource double-bind.”
- Capturing not just language but also the nuances of culture and dialect is a complex but essential challenge, as languages are diverse and not monolithic.
To address these challenges and foster a more inclusive and safe AI ecosystem, the researchers offer specific recommendations for policymakers and the broader research community:
- Support multilingual dataset creation:
- Incentivize and facilitate open access evaluation sets reflecting diverse use cases (generative, safety-relevant, multimodal), using both translation (language-parallel) and localized creation (language-specific).
- Fund long-term annotation efforts, especially in endangered languages, enabling diverse multilingual and multicultural experts to curate inclusive datasets.
- Support multilingual transparency from model providers:
- Encourage model providers to clearly report language coverage and performance for each model family in technical reports.
- Conduct analyses of language coverage within safety research, assessing the presence or absence of safety mitigations across languages in published reports.
- Support multilingual research and development:
- Ensure diverse languages are represented in training programs that expand skills for community engagement, data collection, and model training.
- Fund multilingual and non-English research aimed at closing the language gap.
- Enable greater access to compute resources for multilingual safety research, especially in underserved regions.
Featured image credit