Gemma 4 12B Unified: Google's first encoder-free multimodal model that runs on your laptop

Google DeepMind released Gemma 4 12B Unified, a 12B parameter model with encoder-free architecture that processes text, image, audio, and video without separate encoders. It runs on consumer hardware, uses Apache 2.0, and marks a turning point for local AI.

Google DeepMind has released Gemma 4 12B Unified, the most interesting variant of its new family of open models. This isn’t just another local model — it’s Google’s first multimodal model that completely eliminates dedicated encoders for vision and audio, projecting pixels and sound waves directly into the transformer’s embedding space.

A complete family under Apache 2.0

The Gemma 4 family, originally launched in March 2026, includes five sizes:

E2B (2.3B effective parameters): optimized for mobile and edge devices
E4B (4.5B): same, with more capacity
12B Unified (11.95B total): encoder-free, multimodal, for consumer GPUs
26B A4B MoE (25.2B total, 3.8B active): fast inference with MoE
31B Dense (30.7B): for workstations

All under the Apache 2.0 license, with no usage restrictions.

What does “encoder-free” mean?

Traditional multimodal models (LLaVA, BLIP, earlier Gemma 4) use separate encoders — a vision encoder (like SigLIP) and an audio encoder — that convert images and audio into tokens that are then fed to the LLM. Gemma 4 12B Unified eliminates those encoders entirely: images and audio are projected directly into the transformer’s embedding space through lightweight linear layers.

This brings three practical advantages:

Lower memory usage: without having to load separate encoders, the model uses less VRAM
Lower multimodal latency: no preprocessing encoding stage, everything flows through a single decoder
Unified fine-tuning: the entire model can be fine-tuned in one pass, without freezing encoders

256K token context window and hybrid attention

The model uses a hybrid attention mechanism that interleaves local attention windows (sliding window) with full global attention. This optimizes memory usage for long contexts, which can reach up to 256K tokens. The global layers use unified Keys and Values with Proportional RoPE (p-RoPE), an improvement over traditional RoPE for very long sequences.

Benchmark results

Gemma 4 12B Unified shows competitive numbers for its size:

Benchmark	Gemma 4 12B	Gemma 3 27B (without thinking)
MMLU Pro	77.2%	~65%
AIME 2026	77.5%	—
LiveCodeBench v6	72.0%	—
GPQA Diamond	78.8%	—
MMMU Pro	69.1%	~55%

The improvement is substantial: it outperforms Gemma 3 27B on virtually every benchmark, despite having less than half the total parameters.

Does it really run on consumer hardware?

The million-dollar question. With 11.95B parameters in FP16, the model needs ~24 GB of VRAM — out of reach for consumer GPUs like an RTX 3060 (12GB) or RX 570 (8GB). However, with 4-bit quantization (~6GB) or 8-bit quantization (~12GB), the model fits perfectly on consumer hardware.

Google’s official announcement states the model targets “consumer GPUs and workstations” and that the smaller models are designed for “efficient local execution on laptops and mobile devices.” In practice, the real experience will depend on the quantization format (GGUF, GPTQ, AWQ) and backend (llama.cpp, transformers, MLX).

There are unverified reports of the model running on an AMD RX 570 with 8GB for vision tasks — technically plausible with 4-bit, but Google doesn’t officially confirm it.

Why it matters

Gemma 4 12B Unified represents a shift in Google’s strategy: moving from open models focused on researchers to practical models that any developer can run on their own machine. The combination of encoder-free architecture, Apache 2.0 license, and multimodal capability in a single 12B parameter model makes it a serious option for anyone who wants local AI without relying on the cloud.

The model is available on HuggingFace (google/gemma-4-12B) and has already accumulated over 435,000 downloads.