Model Release
NVIDIA Introduces Nemotron-Labs Diffusion Language Models
NVIDIA released Nemotron-Labs Diffusion models with 3B, 8B, and 14B parameters, offering faster inference speeds and new generation modes.
Image: Hugging Face
NVIDIA has introduced a new family of diffusion language models (DLMs) called Nemotron-Labs Diffusion, designed to improve text generation efficiency. These models generate multiple tokens in parallel and iteratively refine them, offering performance benefits over traditional autoregressive models. The Nemotron-Labs Diffusion family includes text models at 3B, 8B, and 14B scales, along with an 8B vision-language model (VLM), all available under NVIDIA licenses. The models support three generation modes: autoregressive, diffusion, and self-speculation. The diffusion mode achieves 2.6× higher tokens per forward pass (TPF) than autoregressive models, while self-speculation reaches 6.4× TPF with comparable accuracy. NVIDIA also released training code through the Megatron Bridge framework, enabling developers to train and fine-tune the models. Deployment of the models is supported in the main branch of SGLang, with inference available through a GitHub issue tracker. The models can be served in three different ways by adjusting a single line in the algorithm configuration, providing flexibility for developers. *Source: [huggingface](https://huggingface.co/blog/nvidia/nemotron-labs-diffusion)*
Key points
- NVIDIA released Nemotron-Labs Diffusion models with 3B, 8B, and 14B parameters.
- Nemotron-Labs Diffusion models offer faster inference speeds compared to autoregressive models.
- The diffusion mode achieves 2.6× higher tokens per forward pass (TPF) than autoregressive models.
- Self-speculation mode reaches 6.4× TPF with comparable accuracy across evaluated tasks.
- NVIDIA released training code through the Megatron Bridge framework for these models.
- Deployment of the models is supported in the main branch of SGLang.
- The models can be served in three different ways by adjusting a single line in the algorithm configuration.