Model Release

NVIDIA Introduces Nemotron 3.5 ASR for Multilingual Speech Recognition

NVIDIA's Nemotron 3.5 ASR transcribes 40 language-locales in real time with 0.07 seconds latency, surpassing competitors in speed and accuracy.

Image: Hugging Face

NVIDIA has released the Nemotron 3.5 ASR, a 600M-parameter speech-to-text model capable of transcribing 40 language-locales in real time. The model includes punctuation and capitalization, streamlining the transcription process. It is the successor to the popular Nemotron 3 ASR model, which was released earlier this year and is available on Hugging Face. The new model addresses several challenges in multilingual speech recognition, offering a unified solution for diverse language support. Source: huggingface

The Nemotron 3.5 ASR model uses a Cache-Aware FastConformer-RNNT architecture that enables real-time streaming without redundant recomputation. This design reduces compute and end-to-end latency without compromising accuracy. The model is built on a Cache-Aware FastConformer encoder that caches internal states from previous frames, allowing for efficient processing. The RNNT decoder emits text as audio streams in, frame by frame, ideal for live transcription. The model also includes prompt-based language-ID conditioning, allowing for either language detection or specification. Source: huggingface

NVIDIA's Nemotron 3.5 ASR was validated by independent benchmarks at Artificial Analysis, where it ranks second in latency among all streaming ASR models with a 0.07-second delay after speech ends. It is positioned in the 'most attractive quadrant' of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, indicating a strong balance between accuracy and latency. Source: huggingface

Key points

NVIDIA's Nemotron 3.5 ASR transcribes 40 language-locales in real time with punctuation and capitalization built in.
The model uses a Cache-Aware FastConformer-RNNT architecture that streams audio without redundant recomputation.
Nemotron 3.5 ASR ranks second in latency among all streaming ASR models with a 0.07-second delay after speech ends.
The model is built on a Cache-Aware FastConformer encoder that caches internal states from previous frames for efficient processing.
Nemotron 3.5 ASR was validated by independent benchmarks at Artificial Analysis, placing it in the 'most attractive quadrant' of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard.

Source: Hugging Face Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.

NVIDIA Introduces Nemotron 3.5 ASR for Multilingual Speech Recognition

Key points

Related articles

Google Deepmind's GenCeption Uses Video Generators for Computer Vision Tasks

Alibaba's Qwen 3.8 Competes With Kimi K3, Claims Second to Fable 5

Aether-7B-5Attn: Korean Startup Releases Fully Open Foundation Model

Moonshot AI Launches Kimi K3, Open Source AI Model