NVIDIA has released the Nemotron 3.5 ASR, a 600M-parameter speech-to-text model capable of transcribing 40 language-locales in real time. The model includes punctuation and capitalization, streamlining the transcription process. It is the successor to the popular Nemotron 3 ASR model, which was released earlier this year and is available on Hugging Face. The new model addresses several challenges in multilingual speech recognition, offering a unified solution for diverse language support. Source: huggingface
The Nemotron 3.5 ASR model uses a Cache-Aware FastConformer-RNNT architecture that enables real-time streaming without redundant recomputation. This design reduces compute and end-to-end latency without compromising accuracy. The model is built on a Cache-Aware FastConformer encoder that caches internal states from previous frames, allowing for efficient processing. The RNNT decoder emits text as audio streams in, frame by frame, ideal for live transcription. The model also includes prompt-based language-ID conditioning, allowing for either language detection or specification. Source: huggingface
NVIDIA's Nemotron 3.5 ASR was validated by independent benchmarks at Artificial Analysis, where it ranks second in latency among all streaming ASR models with a 0.07-second delay after speech ends. It is positioned in the 'most attractive quadrant' of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, indicating a strong balance between accuracy and latency. Source: huggingface