Model Release
Hugging Face Releases Borealis, 5B Audio-Language Model for Russian and English
Hugging Face announced Borealis, a 5B parameter open-source audio-language model trained on Russian and English data, achieving 20.88% WER on Russian benchmarks.
Image: Hugging Face
Hugging Face has released Borealis, an open-source audio-language model designed for Russian and English. The model, which is 5B parameters in size, was trained using a combination of Whisper3-large and Qwen 4B as the LLM backbone, with an adapter glued in between. According to the company, Borealis was trained from scratch and is an open take on Voxtral / Flamingo-audio. The model is designed to summarize long recordings, answer questions about content, and reason about tone and emotion. The architecture of Borealis includes a strong audio encoder, a strong LLM, and an adapter between them. The model was trained with a 4× downsampler and MLP adapter, with the audio input at 16 kHz. The company noted that the audio encoder is frozen to preserve ASR quality and that the adapter is trained to align the model to the target language. The training data for Borealis was assembled from several data pools, including audiobooks and generated instructions via Gemini 2.5 Pro. The experimental setup included questions about the effectiveness of training data language, the impact of adding plain-text instructions, and the optimal ratios of languages and text data. The results showed that training data language had a significant impact on performance, with EN-only hitting 20.88% WER on Russian benchmarks. The company also noted that adding plain-text instructions had a non-linear impact, with 10–15% text helping and 25% text degrading performance. The webinar problem was identified as a persistent issue, with Borealis achieving 60% WER on webinars compared to 7.77% for plain Whisper. The serving and integration of Borealis with transformers involved asynchronous processing of the encoder and LLM, with the adapter serving as a simple MLP. The company also detailed the augmentation pipeline and the vLLM plugin used to speed up the model. *Source: [huggingface](https://huggingface.co/blog/AlexWortega/borealis)*
Key points
- Hugging Face released Borealis, a 5B parameter open-source audio-language model trained on Russian and English data.
- Borealis achieved 20.88% WER on Russian benchmarks when trained with English-only data.
- The model uses a frozen Whisper Large V3 encoder and a Qwen3-4B LLM with an adapter between them.
- Adding plain-text instructions improved performance up to 10–15% but degraded it at 25%.
- Borealis struggles with webinars, achieving 60% WER compared to 7.77% for plain Whisper.
- The model's adapter is a simple MLP with 31M parameters for the Whisper-large × Qwen3-4B setup.
- Hugging Face developed a vLLM plugin to integrate Borealis with the framework for faster inference.