DeepSeek V4-Pro: 10× KV Cache Efficiency at Open-Source Scale

DeepSeek V4-Pro launched on HuggingFace this morning and reached #1 trending in 43 minutes — described by HuggingFace CEO Clément Delangue as the fastest model ever to claim that position. The velocity reflects genuine engineering substance: at one million tokens of context, V4-Pro requires only 27% of the single-token inference FLOPs of DeepSeek V3.2 and just 10% of its KV cache.

What the Source Actually Says

The engineering implication is concrete. On an NVIDIA GB300 with 176 GB HBM per GPU, DeepSeek V3.2 at 1M context requires 35.60 GB of KV cache, fitting roughly 4 concurrent requests. V4-Pro's 10× KV cache reduction brings that figure below 3.6 GB, enabling approximately 40 concurrent requests on identical hardware. The throughput multiplier is not a benchmark-table artifact — it directly prices inference at scale.

V4-Pro ships with 1.6 trillion total parameters and 49 billion active parameters under a mixture-of-experts architecture, alongside V4-Flash (284 billion total / 13 billion active) as the fast, economical tier. Both models carry a standard 1M-token context window and are fully open-weight with open technical reports published simultaneously on HuggingFace. API access went live the same day. Independent benchmarks show V4-Pro exceeds Claude Opus 4.6 on Terminal Bench and closely matches it across other standard evaluations.

Huawei confirmed its Ascend 950 supernode will fully support V4, extending the model's reach to hardware outside the NVIDIA ecosystem — a signal that coordinated deployment across both hardware stacks was prepared in advance.

Strategic Take

The 10× KV cache reduction is the structural story, not the parameter count. For operators running long-context workloads, V4-Pro's concurrency multiplier compresses cost-per-request dramatically — and as a fully open-weight model, it removes vendor lock-in from the calculation. This is the first open model that competes directly with frontier labs on throughput economics at extended context.