Model Release

Google releases DiffusionGemma, a new open AI model with 4x speed boost

Google DeepMind has released DiffusionGemma, a new open AI model that generates text 4x faster than similar autoregressive models, according to a June 2026 report.

Detailed studio shot of a modern robotic toy with a dark background, showcasing technological design.

Photo: Pavel Danilyuk / Pexels

Google DeepMind has released DiffusionGemma, a new member of the Gemma 4 open model family. This model generates text in parallel rather than sequentially, making it faster and more efficient on local hardware like gaming GPUs. According to the report, DiffusionGemma can produce around 700 tokens per second on an RTX 5090 and 1,000+ tokens per second on an Nvidia H100 AI accelerator.

This represents a fourfold increase in speed compared to similar autoregressive models. The model uses a technique similar to image generation, where it starts with placeholder tokens and iteratively refines them to produce the final output. This approach allows DiffusionGemma to generate up to 256 tokens in parallel, improving performance in tasks like in-line editing and mathematical graphing.

The model is designed to work efficiently on local hardware, which often has lower memory bandwidth and idle time compared to cloud systems. Google has also experimented with diffusion models in cloud-based Gemini systems, but there are drawbacks, including a higher error rate and resource waste for short outputs. Despite these challenges, DiffusionGemma is available under the same Apache 2.0 license as other Gemma models and can be downloaded from Hugging Face.

Google has optimized the model for a variety of setups, including high-end RTX GPUs and enterprise systems like the H100 or DGX Spark platform. Source: arstechnica

Key points

Google DeepMind released DiffusionGemma, a new open AI model that generates text 4x faster than similar autoregressive models.
DiffusionGemma can produce around 700 tokens per second on an RTX 5090 and 1,000+ tokens per second on an Nvidia H100 AI accelerator.
DiffusionGemma uses a technique similar to image generation, where it starts with placeholder tokens and iteratively refines them to produce the final output.
The model is designed to work efficiently on local hardware, which often has lower memory bandwidth and idle time compared to cloud systems.
Google has optimized the model for a variety of setups, including high-end RTX GPUs and enterprise systems like the H100 or DGX Spark platform.

Source: Ars Technica Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.