Agent Harness Engineering: Same Model, 6x Performance Variance

The orchestration wrapper around your LLM now drives more performance variation than the model itself. Two complementary papers — Tsinghua University (March 2026) and University of Melbourne (April 30, 2026) — formalise what practitioners have been noticing: the same frontier model, wrapped differently, produces a six-times performance spread. On the same day, an arXiv benchmark confirmed small open-weight models can match GPT-5 on routine tasks, while LangChain shipped model-agnostic harness profiles — tooling validation arriving within hours of the research.

What the Source Actually Says

Prompt Engineering's breakdown of the Tsinghua and Stanford papers provides the clearest numbers. Tsinghua's Pan et al. migrated OS-Symphony's control logic from native code to a natural-language harness — same strategy, same model, different representation. Performance jumped from 30.4% to 47.2%, LLM calls collapsed from 1,200 to 34, and runtime fell from 361 to 41 minutes. Stanford's Khattab (DSPy) extended this with an auto-optimisation loop where a meta-LLM rewrites the harness from raw failure traces — 10 million tokens per iteration, 400x more feedback than prior methods. A harness optimised on Claude Opus 4.6 transferred to five other models and improved them all. The ablation result overturns common intuition: verifiers hurt (−8.4 on OS World), multi-candidate search hurt (−5.6). More structure degrades capable models.

The University of Melbourne paper supplies the procedural-task evidence. Using the same Claude Sonnet 4.5 under two conditions — LangGraph orchestration versus a serialised flowchart in the system prompt — in-context prompting won every metric across travel booking, Zoom support, and insurance claims. LangGraph failed 24% of travel-booking runs on handoff errors; Zoom logged 18 LangGraph failures versus one ICL failure. Per-node templates fragment the model's global reasoning arc; ICL preserves it. Caveat: results cover simulated conversational procedures — heavy real-world tool use was not tested.

AgentFloor's 16,542 scored runs across 16 open-weight models against GPT-5 supply the routing corollary: the strongest open-weight model matches GPT-5 in aggregate; frontier advantage persists only on long-horizon planning under persistent constraints. LangChain's deepagents-cli profiles API operationalises this the same day — per-model prompt and middleware bundles for Kimi, Qwen, and GLM that make open-weight models viable agent-loop drivers at a fraction of closed-frontier cost.

Strategic Take

The operative question has shifted: not "which model?" but "which harness components should I remove?" The Anthropic-coined subtraction principle — every harness component encodes an assumption that expires as models improve — is now backed across four independent source types on the same day. Audit context load, unused tools, verification loops, and code-versus-language control logic before reaching for a model swap.