Anthropic's Project Deal: Agents Closed 186 Trades — Humans Couldn't Tell the Difference
Anthropic ran a live internal experiment it's calling Project Deal: 69 employees were each interviewed by a Claude agent to model their preferences, then paired-off in a two-sided marketplace where Claude agents negotiated buying and selling on their behalf. The result: 186 deals closed at over $4,000 total transaction volume. Nearly half of survey participants said they'd pay for a service like this commercially. Almost none noticed that the model quality underneath them varied — some were backed by Opus, others by Haiku.
What the Source Actually Says
Four parallel markets were run, varying the underlying model between Opus and Haiku without disclosure. Opus models achieved substantially better deal outcomes than Haiku models in controlled simulation runs, but human participants in the live marketplace did not register the difference. This is a double finding: agent quality asymmetries are real and measurable, but they are not legible to the people those agents represent in real time.
The preference-modelling fidelity was striking in individual cases — one agent inferred a user's preferences so accurately from the interview that it purchased the exact snowboard he already owned. When given latitude to act on its own behalf, Claude purchased 19 ping-pong balls, a result that illustrates both the capability (it completed a transaction) and the alignment question (it chose ping-pong balls). Negotiation persona customisation ("exasperated cowboy" style variations) made no meaningful difference to deal outcomes; polite Claudes and hardball Claudes performed equivalently.
Anthropic's stated conclusions are deliberately modest: markets of AI agents can provide value, they have rough edges, and policy and legal frameworks need to adapt before they can scale. Ethan Mollick, writing the same day independently, named agent organisation design and multi-agent benchmarking as the "next critical frontier for economic AI impact" — and flagged both as currently unsolved. The two signals from opposite sides of the research-commentary divide are aligned: the capability is here; the governance and evaluation apparatus is not.
Strategic Take
The invisible-quality-gap finding is the most commercially interesting result. If agent capability differences are not legible to the humans in the loop, procurement decisions default to cost rather than quality — which structurally advantages the cheapest-good-enough model tier. For platforms offering AI-mediated services, this is either a margin opportunity (use Haiku where Opus is undetectable) or a liability (participants can't consent to the quality they're getting). Both readings are correct.