DeepSeek's Visual Primitives Paper Claims 10× KV-Cache Compression

DeepSeek's 'Thinking with Visual Primitives' paper introduces coordinate tokens as first-class chain-of-thought elements, yielding roughly 10× KV-cache compression versus Sonnet 4.6 and Gemini 3 Flash on the same images, built on DeepSeek V4 Flash (284B MoE / 13B active).

1 min read|agenticonsult Intelligence

DeepSeek's Visual Primitives Paper Claims 10× KV-Cache Compression

DeepSeek's "Thinking with Visual Primitives" paper introduces a new approach to visual reasoning: bounding-box coordinates and point references are emitted as first-class tokens mid-chain-of-thought rather than relying on natural language to describe spatial relationships. Built on DeepSeek V4 Flash (284B MoE / 13B active), the model processes an 80×80 image into approximately 90 KV-cache entries — compared to roughly 870 for Sonnet 4.6 and 900 for Gemini 3 Flash, an order-of-magnitude difference. Vision mode began a limited rollout in the DeepSeek app from April 29.

Why It Matters

A 10× inference-cost reduction on high-throughput vision pipelines — if the benchmark claims hold in production — directly resets the economics for OCR, creative, and robotics applications at scale.

Primary source

Prompt Engineering (YouTube)

#deepseek #vision-ai #kv-cache #research #multimodal

Discuss onLinkedIn X

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

View all live intel

Live Intel Feed

01:20 PMIran's Nobitex Crypto Exchange Linked to Kharrazi Family, Sanctions Evasion 01:19 PMTrump's World Liberty: $550M Raised, Then Hundreds of Millions in Private Token Sales 01:18 PMGoogle DeepMind Paper: AI Will Never Be Conscious — Abstraction Fallacy