DeepMind has launched Gemma 4 12B, a new multimodal model designed to deliver high-performance capabilities on consumer laptops. The model combines mobile-first efficiency with advanced reasoning, enabling agentic workflows directly on local hardware. It is the first mid-sized model to include native audio inputs, allowing developers to create applications that integrate visual and auditory data seamlessly. The release follows a surge in developer adoption, with Gemma 4 models now exceeding 150 million downloads globally.
Gemma 4 12B introduces a novel unified architecture by eliminating traditional multimodal encoders. Instead of using separate encoders for vision and audio, the model processes these inputs directly through its language model backbone. This approach reduces latency and memory usage, enabling the model to run efficiently on devices with as little as 16GB of VRAM. The model also supports Multi-Token Prediction (MTP) drafters, which help minimize latency during inference. These features allow developers to build complex applications without compromising performance or speed.
According to the source, Gemma 4 12B is engineered to deliver benchmark performance comparable to the larger 26B MoE model, but with a significantly smaller memory footprint. The model's architecture is optimized for local execution, making it accessible for developers who want to run advanced multimodal tasks on personal devices. DeepMind emphasized the model's open-source nature, releasing it under an Apache 2.0 license to foster broader innovation and integration across the developer ecosystem.
Source: deepmind