DUET Paper: Dual-LM Inference Cuts 70% of Reasoning Tokens at Near-Parity Accuracy
Researchers from Boston University, MIT, and Biohub published DUET (May 1, 2026), a dual-mode inference architecture that separates reasoning (handled by a large capable model) from response generation (handled by a small lightweight model) via a jointly trained, bandwidth-limited communication channel. Benchmarks on MATH-500, AMC 23/24, and GPQA Diamond show approximately 70% output-token reduction at parity or better accuracy compared to single-model baselines and prompt-based GRPO. Tested on Qwen 4B + 0.6B under a 4×H100 compute budget.
Why It Matters
DUET suggests that separating reasoning compute from generation compute — rather than scaling a single model — is a viable path to frontier-level accuracy at a fraction of the inference cost. If it scales beyond the constrained experimental envelope, it could reshape the economics of production agent chains.