Google DeepMind has released Gemma 4 12B Unified, the most interesting variant of its new family of open models. This isn’t just another local model — it’s Google’s first multimodal model that completely eliminates dedicated encoders for vision and audio, projecting pixels and sound waves directly into the transformer’s embedding space.
A complete family under Apache 2.0
The Gemma 4 family, originally launched in March 2026, includes five sizes:
- E2B (2.3B effective parameters): optimized for mobile and edge devices
- E4B (4.5B): same, with more capacity
- 12B Unified (11.95B total): encoder-free, multimodal, for consumer GPUs
- 26B A4B MoE (25.2B total, 3.8B active): fast inference with MoE
- 31B Dense (30.7B): for workstations
All under the Apache 2.0 license, with no usage restrictions.
What does “encoder-free” mean?
Traditional multimodal models (LLaVA, BLIP, earlier Gemma 4) use separate encoders — a vision encoder (like SigLIP) and an audio encoder — that convert images and audio into tokens that are then fed to the LLM. Gemma 4 12B Unified eliminates those encoders entirely: images and audio are projected directly into the transformer’s embedding space through lightweight linear layers.
This brings three practical advantages:
- Lower memory usage: without having to load separate encoders, the model uses less VRAM
- Lower multimodal latency: no preprocessing encoding stage, everything flows through a single decoder
- Unified fine-tuning: the entire model can be fine-tuned in one pass, without freezing encoders
256K token context window and hybrid attention
The model uses a hybrid attention mechanism that interleaves local attention windows (sliding window) with full global attention. This optimizes memory usage for long contexts, which can reach up to 256K tokens. The global layers use unified Keys and Values with Proportional RoPE (p-RoPE), an improvement over traditional RoPE for very long sequences.
Benchmark results
Gemma 4 12B Unified shows competitive numbers for its size:
| Benchmark | Gemma 4 12B | Gemma 3 27B (without thinking) |
|---|---|---|
| MMLU Pro | 77.2% | ~65% |
| AIME 2026 | 77.5% | — |
| LiveCodeBench v6 | 72.0% | — |
| GPQA Diamond | 78.8% | — |
| MMMU Pro | 69.1% | ~55% |
The improvement is substantial: it outperforms Gemma 3 27B on virtually every benchmark, despite having less than half the total parameters.
Does it really run on consumer hardware?
The million-dollar question. With 11.95B parameters in FP16, the model needs ~24 GB of VRAM — out of reach for consumer GPUs like an RTX 3060 (12GB) or RX 570 (8GB). However, with 4-bit quantization (~6GB) or 8-bit quantization (~12GB), the model fits perfectly on consumer hardware.
Google’s official announcement states the model targets “consumer GPUs and workstations” and that the smaller models are designed for “efficient local execution on laptops and mobile devices.” In practice, the real experience will depend on the quantization format (GGUF, GPTQ, AWQ) and backend (llama.cpp, transformers, MLX).
There are unverified reports of the model running on an AMD RX 570 with 8GB for vision tasks — technically plausible with 4-bit, but Google doesn’t officially confirm it.
Why it matters
Gemma 4 12B Unified represents a shift in Google’s strategy: moving from open models focused on researchers to practical models that any developer can run on their own machine. The combination of encoder-free architecture, Apache 2.0 license, and multimodal capability in a single 12B parameter model makes it a serious option for anyone who wants local AI without relying on the cloud.
The model is available on HuggingFace (google/gemma-4-12B) and has already accumulated over 435,000 downloads.