DUET Paper: Dual-LM Inference Cuts 70% of Reasoning Tokens at Near-Parity Accuracy

A new paper from Boston University, MIT, and Biohub proposes DUET — a dual-LM inference paradigm that delegates reasoning to a large model and response generation to a small model via a learned bandwidth-limited channel — achieving ~70% output-token reduction at parity or better accuracy vs single-model baselines.

1 min read|agenticonsult Intelligence

DUET Paper: Dual-LM Inference Cuts 70% of Reasoning Tokens at Near-Parity Accuracy

Researchers from Boston University, MIT, and Biohub published DUET (May 1, 2026), a dual-mode inference architecture that separates reasoning (handled by a large capable model) from response generation (handled by a small lightweight model) via a jointly trained, bandwidth-limited communication channel. Benchmarks on MATH-500, AMC 23/24, and GPQA Diamond show approximately 70% output-token reduction at parity or better accuracy compared to single-model baselines and prompt-based GRPO. Tested on Qwen 4B + 0.6B under a 4×H100 compute budget.

Why It Matters

DUET suggests that separating reasoning compute from generation compute — rather than scaling a single model — is a viable path to frontier-level accuracy at a fraction of the inference cost. If it scales beyond the constrained experimental envelope, it could reshape the economics of production agent chains.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

DUET Paper: Dual-LM Inference Cuts 70% of Reasoning Tokens at Near-Parity Accuracy

DUET Paper: Dual-LM Inference Cuts 70% of Reasoning Tokens at Near-Parity Accuracy

Why It Matters

Live Intel Feed