Google has released DiffusionGemma, an experimental open-source language model that generates text by starting with a block of 256 random tokens and refining them over several passes until readable text emerges. This approach, inspired by image AI diffusion models, allows the model to generate blocks of text simultaneously rather than word by word. According to Google, the model runs up to four times faster in single-user mode on dedicated GPUs compared to conventional language models. The speed gain is attributed to better utilization of GPU compute units, which remain busy during inference rather than waiting for data from memory.
DiffusionGemma processes up to 256 tokens in parallel, which shifts the bottleneck from memory bandwidth to raw compute power. Nvidia reports the model achieves about 700 tokens per second on a GeForce RTX 5090 and up to 800 tokens per second on the DGX Station. In Google's own benchmarks, DiffusionGemma runs about three and a half times faster than a same-size Gemma 4 model but scores slightly lower on accuracy tests. The model also has 26 billion parameters total, with only 3.8 billion activated per step thanks to a mixture-of-experts architecture.
Google positions DiffusionGemma as a tool for researchers and developers experimenting with fast, local workflows, while recommending traditional Gemma 4 models for tasks where output quality is critical. The model is available on Hugging Face under an Apache 2.0 license and works with common inference libraries like Hugging Face Transformers and vLLM. It also supports fine-tuning through tools like Hackable Diffusion, Unsloth, and Nvidia NeMo Framework. Nvidia has optimized the model for RTX 5090 and 4090 GPUs, as well as Hopper and Blackwell server architectures.
Source: thedecoder