Model Release

DeepMind Unveils Gemma 4 12B Multimodal Model

DeepMind released Gemma 4 12B, a mid-sized model with 150 million downloads, offering advanced reasoning and audio support.

Image: Google DeepMind

DeepMind has launched Gemma 4 12B, a new multimodal model designed to deliver high-performance capabilities on consumer laptops. The model combines mobile-first efficiency with advanced reasoning, enabling agentic workflows directly on local hardware. It is the first mid-sized model to include native audio inputs, allowing developers to create applications that integrate visual and auditory data seamlessly. The release follows a surge in developer adoption, with Gemma 4 models now exceeding 150 million downloads globally.

Gemma 4 12B introduces a novel unified architecture by eliminating traditional multimodal encoders. Instead of using separate encoders for vision and audio, the model processes these inputs directly through its language model backbone. This approach reduces latency and memory usage, enabling the model to run efficiently on devices with as little as 16GB of VRAM. The model also supports Multi-Token Prediction (MTP) drafters, which help minimize latency during inference. These features allow developers to build complex applications without compromising performance or speed.

According to the source, Gemma 4 12B is engineered to deliver benchmark performance comparable to the larger 26B MoE model, but with a significantly smaller memory footprint. The model's architecture is optimized for local execution, making it accessible for developers who want to run advanced multimodal tasks on personal devices. DeepMind emphasized the model's open-source nature, releasing it under an Apache 2.0 license to foster broader innovation and integration across the developer ecosystem.

Source: deepmind

Key points

DeepMind released Gemma 4 12B, a mid-sized model with 150 million downloads.
Gemma 4 12B is the first mid-sized model to feature native audio inputs.
The model processes visual and audio inputs directly through its language model backbone.
Gemma 4 12B delivers benchmark performance nearing the 26B MoE model, but with less than half the memory footprint.
The model is designed to run locally on consumer laptops with 16GB of VRAM.
Gemma 4 12B is released under an Apache 2.0 license to support broader integration and innovation.

Source: Google DeepMind Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.

DeepMind Unveils Gemma 4 12B Multimodal Model

Key points

Related articles

Anthropic's Claude Opus 5 Costs Less Than Fable 5 While Matching Performance

Anthropic Releases Opus 5 Focused on Token Efficiency

Moonshot AI's Kimi K3 Sparks US-China AI Race

Kimi K3 Sparks AI Panic Amid U.S. Industry Reactions