Google DeepMind has released DiffusionGemma, a new member of the Gemma 4 open model family. This model generates text in parallel rather than sequentially, making it faster and more efficient on local hardware like gaming GPUs. According to the report, DiffusionGemma can produce around 700 tokens per second on an RTX 5090 and 1,000+ tokens per second on an Nvidia H100 AI accelerator.

This represents a fourfold increase in speed compared to similar autoregressive models. The model uses a technique similar to image generation, where it starts with placeholder tokens and iteratively refines them to produce the final output. This approach allows DiffusionGemma to generate up to 256 tokens in parallel, improving performance in tasks like in-line editing and mathematical graphing.

The model is designed to work efficiently on local hardware, which often has lower memory bandwidth and idle time compared to cloud systems. Google has also experimented with diffusion models in cloud-based Gemini systems, but there are drawbacks, including a higher error rate and resource waste for short outputs. Despite these challenges, DiffusionGemma is available under the same Apache 2.0 license as other Gemma models and can be downloaded from Hugging Face.

Google has optimized the model for a variety of setups, including high-end RTX GPUs and enterprise systems like the H100 or DGX Spark platform. Source: arstechnica