Model Release

DeepMind Releases DiffusionGemma, 4x Faster Text Generation Model

DeepMind announced DiffusionGemma, an experimental open model that generates text up to 4x faster on GPUs, with performance benchmarks reaching 1000+ tokens per second on an NVIDIA H100.

Image: Google DeepMind

DeepMind has introduced DiffusionGemma, an experimental open model that significantly enhances text generation speed. The model, released under an Apache 2.0 license, is designed to deliver up to 4x faster inference on dedicated GPUs. This advancement allows for speed-critical, interactive local workflows, such as in-line editing and rapid iteration. The model uses a novel diffusion head to maximize generation speed, making it suitable for developers exploring fast, parallel text generation. According to DeepMind, DiffusionGemma operates as a 26B total Mixture of Experts (MoE) model, activating only 3.8B parameters during inference. This design enables it to fit within 18GB VRAM limits of high-end consumer GPUs when quantized. The model generates 256 tokens in parallel with each forward pass, allowing every token to attend to all others, which provides significant advantages for non-linear domains like code infilling and mathematical graphs. Source: deepmind

DiffusionGemma's approach to text generation differs from traditional autoregressive models. While most language models generate text sequentially, DiffusionGemma drafts an entire 256-token paragraph simultaneously. This method utilizes hardware more efficiently by providing the processor with a larger chunk of work at once. The model's speedup is particularly beneficial for local and low-concurrency inference, though it may offer diminishing returns in high-QPS cloud serving environments. DeepMind emphasized that while DiffusionGemma prioritizes speed, its overall output quality is lower than the standard Gemma 4 model. For applications requiring maximum quality, they recommend deploying the standard Gemma 4. Source: deepmind

The development of DiffusionGemma builds on previous research in diffusion-based text generation. While the AI community has explored this approach for years, applying it to large models has been a challenge. DiffusionGemma addresses this by shifting how models use hardware, allowing for parallel decoding and efficient local inference. The model's architecture enables it to process the entire paragraph while generating, unlocking new patterns of model behavior, such as perfect markdown formatting and near real-time code generation. DeepMind also highlighted the model's compatibility with various development tools and hardware, including NVIDIA's consumer and enterprise systems. Source: deepmind

Key points

DiffusionGemma is an experimental open model that generates text up to 4x faster on GPUs.
DiffusionGemma operates as a 26B total Mixture of Experts (MoE) model, activating only 3.8B parameters during inference.
DiffusionGemma generates 256 tokens in parallel with each forward pass, allowing every token to attend to all others.
DiffusionGemma's speedup is designed for local and low-concurrency inference, with diminishing returns in high-QPS cloud serving.
DiffusionGemma's overall output quality is lower than standard Gemma 4, making Gemma 4 the recommended choice for maximum quality.
DiffusionGemma's architecture enables it to process the entire paragraph while generating, unlocking new patterns of model behavior.

Source: Google DeepMind Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.