NanoGPT-Bench: Coding Agents Recover Only 9.3% of Human AI R&D Progress
IntologyAI's NanoGPT-Bench tests Codex, Claude Code, and Autoresearch on the NanoGPT Speedrun — a five-month window of world-record submissions spanning approximately two years of human contributions. Agents recover 9.3% of human AI research progress overall, with behavior concentrated on hyperparameter tuning. The algorithmic research — the source of most human gains — is largely ignored by agents. Evaluation ran fully autonomously with no internet access and no human intervention.
Why It Matters
This establishes a concrete capability gap: current frontier coding agents can replicate and optimize existing approaches but are not yet generating the algorithmic innovations that drive research progress. The gap is not about benchmark gaming — the full NanoGPT Speedrun is a real competitive research trajectory.