Google Deepmind has released Gemma 4 12B, an open AI model that brings multimodal capabilities to everyday laptops. The model processes text, images, and audio natively without separate encoders, reducing processing time, memory use, and latency, according to Google. It runs locally with just 16 GB of RAM and nearly matches the 26B model—twice its size—across benchmarks, Google says. It's also the first mid-sized Gemma model with native audio processing.
Gemma 4 12B handles speech recognition, code generation, and video analysis. Per the Developer Guide, it can parse multi-minute video clips by analyzing frames and audio together. In one demo, it processed a five-minute Google I/O keynote clip: 313 frames at one per second, plus audio. Across benchmarks like GPQA Diamond, MMLU Pro, and DocVQA, Gemma 4 12B nearly matches the twice-as-large 26B model and clearly beats the older Gemma 3 27B.
The model is available on Hugging Face, Ollama, LM Studio, and other platforms, licensed under Apache 2.0 for commercial use. Source: thedecoder