Cursor Composer 2.5: 79.8% SWE-Bench at Under $1/Task

Cursor has shipped Composer 2.5, its first internally trained coding model, scoring 79.8% on SWE-Bench Multilingual — matching Claude Opus 4.7 and GPT-5.5 — at under $1 per task versus up to $11 for comparable models. The gain comes not from switching to a new base model but from training the open-source Kimi K2.5 on 25× more synthetic tasks with mid-task feedback loops, while a separate from-scratch model is in training on SpaceXAI's Colossus cluster.

What the Source Actually Says

AlphaSignal's May 19 newsletter provides the technical mechanics: Composer 2.5 starts with the same Kimi K2.5 open-source base Cursor has used before, adds 25× the volume of synthetic training tasks, and introduces a mid-task feedback mechanism that lets the model learn from intermediate mistakes rather than just final outcomes — described as a student who gets notes on exactly where they went wrong, not just a final grade. The measurable result: SWE-Bench Multilingual score jumps from 73.7% to 79.8%, benchmark parity with Claude Opus 4.7 and GPT-5.5, meaningfully stronger performance on long multi-file tasks and complex multi-step instructions, and a cost floor under $1/task where rival models charge up to $11. Double usage credits are available to all Cursor users for one week post-launch. The model lives inside the Cursor IDE, CLI, and web only — no public API.

Hugging Face independently confirmed the significance: Clément Delangue (Hugging Face CEO) framed Composer 2.5 as evidence that "ultimately all serious companies in AI will want to train models themselves, based on open-source instead of outsourcing via APIs." The launch is Cursor's first model built under that philosophy. Separately, Cursor is training an entirely new from-scratch model on a one-million-H100-equivalent SpaceXAI Colossus cluster — a longer-horizon project distinct from this fine-tuning work.

Strategic Take

The pattern here is directly applicable: take an open-source base, train vertically on a 25× larger domain-specific task corpus, achieve 91% cost reduction while matching frontier accuracy on the relevant benchmark. SWE-Bench parity at $1 vs $11 is not a curiosity — it is a procurement argument. Teams evaluating coding agents should run this against their own task distribution before defaulting to frontier API pricing.