FinAI Study: Same Model, Different Harness — Up to 3× Accuracy Swing
A 13-institution, NVIDIA-funded benchmark published May 13 ran four frontier LLMs across five agent frameworks on financial tasks including trading, hedging, market insights, and auditing. The headline finding: Claude Sonnet 4.6 hit 66.15% auditing accuracy under Claude Code or OpenClaw but collapsed to 20% under ReAct on the identical model backbone — a 3× swing from framework choice alone. No configuration maintained performance when the live evaluation regime shifted from bearish to bullish market conditions.
Why It Matters
Framework selection is now a quantifiable business risk for agentic deployments. The 3× accuracy swing from harness choice makes model selection a secondary variable — enterprises that optimize only on model benchmarks are optimizing the wrong dimension.