Research
Tokenization Challenges Hinder Multilingual LLMs
A 2026 study highlights tokenization's role in poor performance of low-resource language models, with 340+ languages struggling due to suboptimal segmentation.
Image: Hugging Face
Training a language model for a specific language can be a frustrating process. Despite meticulous data curation, careful architecture selection, and tuning, the model may still fail to perform as expected. This is often due to tokenization, the process of breaking text into meaningful units for the model to work with. In 2023, Omar Kamali trained the first version of Sawalni.ma, a model for Moroccan Arabic and Amazigh, but it failed to meet the quality of the data. This led to a years-long investigation and the creation of Wikilangs, a collection of 1800+ NLP models across 340+ Wikipedia languages. The pattern repeated: every language that struggled did so first at the token boundary. Tokenization is the tax that low-resource languages cannot afford to pay, and they're being charged it on every token, in every layer, for every variant their speakers write. *Source: [huggingface](https://huggingface.co/blog/omarkamali/tokenization)*
Tokenization is the process of converting raw text into numbers that a language model can work with. This is crucial for how the model processes and generates text. When the tokenization cuts happen to land on meaningful units, the model can reconstruct meaning quickly. However, if the cuts are arbitrary, the model has to do extra work to figure out what it's looking at before it can begin to understand it. This can lead to issues where the model hallucinates structure, loses track of morphology mid-sentence, or fails on inputs that are trivially easy for a native speaker. Tokenization can be constructive, producing useful segmentation, but it can also generate non-existent words. The tokens not only affect the input and output units but also determine the legos the model uses to build meaning internally. If the tokens are not aligned with the language's structure, the model's understanding can be flawed. *Source: [huggingface](https://huggingface.co/blog/omarkamali/tokenization)*
Creating a custom tokenizer for a specific language can improve performance, but it is a local optimization and not a comprehensive solution. Custom tokenizers reduce fertility and partially improve boundary coherence but do nothing for variant recovery. They also destroy cross-lingual alignment, making multilingual transfer difficult. The field has tried various methods like vocabulary expansion and bilingual tokenizer training, but none compose well. The dream of a single model handling multiple languages remains blocked at the tokenization layer. Adding tokens for each language increases the model size significantly and slows down generation due to the softmax function. *Source: [huggingface](https://huggingface.co/blog/omarkamali/tokenization)*
Key points
- Omar Kamali trained the first version of Sawalni.ma in 2023 for Moroccan Arabic and Amazigh.
- Wikilangs includes 1800+ NLP models across 340+ Wikipedia languages.
- Tokenization is the tax that low-resource languages cannot afford to pay.
- Tokenization converts raw text into numbers a language model actually works with.
- Custom tokenizers reduce fertility and partially improve boundary coherence.
- Adding tokens for each language increases the model size significantly.
- The dream of a single model handling multiple languages remains blocked at the tokenization layer.