Gemma 4 Beats Models 20× Its Size on LM Arena, Goes Apache 2.0
Google DeepMind shipped Gemma 4 under Apache 2.0 — a deliberate relicensing that positions the family as infrastructure-grade open source for on-device and agentic workloads. Four sizes span from E2B and E4B (on-device multimodal) through 26B MoE and a 31B dense flagship that already sits at #3 on the global LM Arena leaderboard, outperforming closed models up to 20× its size.
What the Source Actually Says
Google DeepMind researcher Cassidy Hardin walked through four compound efficiency gains at AI Engineer. First, a 5:1 local-to-global attention ratio: local layers use sliding-window attention (512–1,024 tokens), the final layer is always global, and grouped query attention runs at 2:1 locally, 8:1 globally. Second, Per-Layer Embeddings (PLE): each layer gets its own 256-dim table stored in flash, not VRAM — the mechanism that makes E2B's "effective 2B" framing real (actual representational depth: 5.1B) without touching GPU memory.
The 26B introduces Gemma's first mixture-of-experts architecture: 128 total experts, 8 active per forward pass, plus a permanently-on shared expert 3× the regular size. Both large models sit in the top six open-source slots on LM Arena; the 31B holds #3 globally with 256K context and native function calling. Deployment spans self-hosted (Hugging Face, Kaggle, Ollama) and cloud-hosted (AI Studio, Vertex AI). The vision encoder ditches Gemma 3's pan-and-scan for variable aspect-ratio and variable-resolution inputs; E2B and E4B add a 35M-parameter conformer audio encoder targeting translation and ASR.
A second source — the official @googlegemma account amplified by Hugging Face — confirms Gemma 4 E2B already powers a fully local browser agent via WebGPU with zero server infrastructure. The agent handles browsing history, page summarization, and tab management entirely client-side.
Strategic Take
Apache 2.0, arena-competitive benchmarks, and on-device multimodal together is a combination the open-source stack hasn't had at this parameter count before. The WebGPU browser agent removes the last infrastructure dependency for privacy-first or edge-deployed builds. Teams evaluating proprietary APIs should recheck their baseline: the 31B self-hosted model now clears quality bars that previously required API access.