DeepSeek's Visual Primitives Paper Claims 10× KV-Cache Compression

DeepSeek's "Thinking with Visual Primitives" paper introduces a new approach to visual reasoning: bounding-box coordinates and point references are emitted as first-class tokens mid-chain-of-thought rather than relying on natural language to describe spatial relationships. Built on DeepSeek V4 Flash (284B MoE / 13B active), the model processes an 80×80 image into approximately 90 KV-cache entries — compared to roughly 870 for Sonnet 4.6 and 900 for Gemini 3 Flash, an order-of-magnitude difference. Vision mode began a limited rollout in the DeepSeek app from April 29.

Why It Matters

A 10× inference-cost reduction on high-throughput vision pipelines — if the benchmark claims hold in production — directly resets the economics for OCR, creative, and robotics applications at scale.