MILKYWAY Shows Agent Scaffolding Can Outperform Fine-Tuning

A new paper from City University of Hong Kong, Tsinghua University, and USTC proposes MILKYWAY: a temporal prediction system that leaves the base LLM (GPT-5.4) entirely frozen and externalizes all learning into an editable text harness — a structured skill-file rewritten by a second "harness editor" agent as new evidence arrives before an unresolved event resolves.

What the Source Actually Says

The architecture operates through three text-manipulable components: F (factors relevant to the prediction), E (evidence gathered exclusively from primary sources), and T (uncertainty instructions). A harness editor agent rewrites these files as new data arrives — a Fed labor report, a central bank statement — while the base model's weights remain untouched throughout. The system enforces a primary-sources-only rule: Federal Reserve press releases, not news commentary.

Benchmark results are significant. MILKYWAY scores 61% on the Future-X and Future-World prediction benchmarks against GPT-5.4 plus live web search at 44% — a 17-percentage-point advantage. At five days before the predicted event, the gap widens: 70% versus 57%. The harness also enforces holding multiple hypotheses simultaneously rather than collapsing prematurely to a single confident answer — a technique the paper's authors describe as addressing the documented LLM failure mode of "premature uncertainty collapse."

The deepest limitation is the most instructive: delete the harness and the model reverts to base incompetence on that task. No learning enters the neural network. The knowledge lives in a text file alongside the LLM. A further failure mode — harness poisoning — occurs when an incorrect lesson encoded early in the temporal window propagates forward through every subsequent update, taking months to detect and correct.

Strategic Take

MILKYWAY is empirical evidence that externalized, editable text harnesses can deliver benchmark-level gains currently attributed to model upgrades or fine-tuning. The skill-file pattern is converging across MILKYWAY, Legora's agent skills system, and Claude Code's own Skills convention — suggesting editable instruction files are solidifying as a cross-tool abstraction. For teams building agentic systems, this is an argument for investing in harness quality now rather than waiting on the next base-model release.

#llm-research

MILKYWAY Shows Agent Scaffolding Can Outperform Fine-Tuning

AI Intelligence Newsletter