DeepMind has introduced DiffusionGemma, an experimental open model that significantly enhances text generation speed. The model, released under an Apache 2.0 license, is designed to deliver up to 4x faster inference on dedicated GPUs. This advancement allows for speed-critical, interactive local workflows, such as in-line editing and rapid iteration. The model uses a novel diffusion head to maximize generation speed, making it suitable for developers exploring fast, parallel text generation. According to DeepMind, DiffusionGemma operates as a 26B total Mixture of Experts (MoE) model, activating only 3.8B parameters during inference. This design enables it to fit within 18GB VRAM limits of high-end consumer GPUs when quantized. The model generates 256 tokens in parallel with each forward pass, allowing every token to attend to all others, which provides significant advantages for non-linear domains like code infilling and mathematical graphs. Source: deepmind

DiffusionGemma's approach to text generation differs from traditional autoregressive models. While most language models generate text sequentially, DiffusionGemma drafts an entire 256-token paragraph simultaneously. This method utilizes hardware more efficiently by providing the processor with a larger chunk of work at once. The model's speedup is particularly beneficial for local and low-concurrency inference, though it may offer diminishing returns in high-QPS cloud serving environments. DeepMind emphasized that while DiffusionGemma prioritizes speed, its overall output quality is lower than the standard Gemma 4 model. For applications requiring maximum quality, they recommend deploying the standard Gemma 4. Source: deepmind

The development of DiffusionGemma builds on previous research in diffusion-based text generation. While the AI community has explored this approach for years, applying it to large models has been a challenge. DiffusionGemma addresses this by shifting how models use hardware, allowing for parallel decoding and efficient local inference. The model's architecture enables it to process the entire paragraph while generating, unlocking new patterns of model behavior, such as perfect markdown formatting and near real-time code generation. DeepMind also highlighted the model's compatibility with various development tools and hardware, including NVIDIA's consumer and enterprise systems. Source: deepmind