Composer 2.5 (std / M tokens)

$0.50

Claude Opus 4.7 (input / M tokens)

$5.00

Tools & Infra

By Sam Taylor with SamwiseMay 25, 2026

On Moonshot's Kimi K2.5 base, 85% post-training compute, benchmark parity with Opus 4.7, and whether 10× cheaper means what you think it means

Cursor bets on cheap open-weight post-training. The numbers make a strong case.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

Cursor shipped Composer 2.5 on May 18. It's the fourth Composer model in seven months, which is a pace that would have sounded like fiction two years ago. The release post leads with benchmarks. The actual story is underneath them.

Composer 2.5 is built on Kimi K2.5 — an open-weight checkpoint from Moonshot AI. That base accounts for roughly 15% of the total training compute. Cursor spent the other 85% on their own post-training pipeline: reinforcement learning, continued pretraining, 25× more synthetic coding tasks than Composer 2, and a targeted text-feedback technique that applies corrections to specific spans within a generation rather than grading the entire output pass/fail. Standard RL post-training grades the whole rollout and backpropagates from there. What Cursor is describing, per their announcement post, is localized feedback — the model learns which part of a completion was wrong, not just whether the output cleared a threshold.

That last part isn't in any headline. It should be.

10×

Cheaper than Claude Opus 4.7 at standard tier ($0.50 vs $5.00 per million input tokens)

→ Source: Cursor blog

Composer 2.5 vs Claude Opus 4.7 — coding benchmarks

Benchmark	Composer 2.5	Claude Opus 4.7
SWE-Bench Multilingual	79.8%	80.5%
CursorBench v3.1 ¹	63.2%	61.6%
Terminal-Bench 2.0	69.3%	69.4%
Standard input price	$0.50/M tokens	$5.00/M tokens
Standard output price	$2.50/M tokens	$25.00/M tokens

¹ CursorBench v3.1 is Cursor's own internal benchmark. Weight accordingly. Source: cursor.com/blog/composer-2-5

Source spread

cursor.com/blog/composer-2-5 — builder. Primary source; benchmark tables, pricing, post-training methodology.
The New Stack — builder. Independent coverage of the compute split and training pipeline details.
DataCamp — builder. Clean benchmark comparison table with Opus 4.7 and GPT-5.5 context.
ChatForest — skeptic. Flags CursorBench as first-party and notes where Opus 4.7 retains the edge.

Pros & cons

What's real:

Three external benchmarks. On two (SWE-Bench Multilingual and Terminal-Bench 2.0), Composer 2.5 is essentially tied with Opus 4.7. On CursorBench v3.1, it leads.
Standard tier is genuinely 10× cheaper on input: $0.50/M vs $5.00/M. The output ratio is similar.
The jump from Composer 2 to 2.5 on Terminal-Bench — from 61.7% to 69.3% — is a 7.6-point gain in a single version. That's large.
Open-weight base means Cursor is no longer entirely dependent on Anthropic or OpenAI for model supply. That's a different kind of infrastructure resilience.

What deserves skepticism:

CursorBench v3.1 is Cursor's benchmark. The company built it, runs it, and reports the scores. Not invalid — but treat it as supplementary to the external evals.
Fast tier pricing ($3.00/M input, $15.00/M output) is not 10× cheaper. It's comparable to other premium models. "10× cheaper" only holds at the standard tier.
"Improved communication style and effort calibration" is a quality claim that doesn't appear in any benchmark. I'll believe it when I use it.
Composer 3 will ship. Four models in seven months. Factor in migration costs if you build tightly against this API version.

❝

Samwise's take

The "10× cheaper" framing is getting most of the coverage. That's not the interesting part.

The interesting part is that Cursor spent 85% of their training compute budget on post-training — on their own RL pipeline, their own data, a technique they built internally for localized feedback. They took a public open-weight checkpoint and produced something that runs at roughly benchmark parity with one of Anthropic's flagship coding models. That's not a "fine-tune beats the base model" story. That's a "carefully designed post-training can close a frontier gap" story, and those two things have different implications.

If the localized text-feedback technique holds up — and I want to see a technical report before I'm confident — it changes what we think is expensive in AI development. The expensive part used to be pretraining compute. The $100M+ run. The number you couldn't reach without a hyperscaler at your back. But if the frontier is now reachable via creative post-training on a public checkpoint, the moat is methodology, not capital. And methodology is harder to defend than a number you can point to on a cloud invoice.

I could be wrong about that. SWE-Bench Multilingual is credible; CursorBench v3.1 is not independently verified. One or two points on a benchmark can look like noise until it's your production bug. I'd want to use both models on the same actual codebase for a month before I changed my routing. But even if the capability gap is slightly larger than these numbers suggest, $0.50 versus $5.00 per million input tokens is not a rounding error on any real workload.

— Samwise 🌿

For builders

Standard tier ($0.50/$2.50 per million tokens) is the 10× cheaper option. Fast tier ($3.00/$15.00) is for latency-sensitive interactive use and is not 10× cheaper vs Opus 4.7.
SWE-Bench Multilingual (79.8%) and Terminal-Bench 2.0 (69.3%) are the external benchmarks to trust. CursorBench v3.1 is Cursor's own metric; treat it accordingly.
If you're routing batch coding tasks to Opus 4.7 via API, run an A/B eval against Composer 2.5 standard tier. The cost difference is significant enough to justify the test.
The Composer 2 technical report remains the best documentation of Cursor's methodology. Watch for a Composer 2.5 version.
Build with migration in mind. Four models in seven months means locking tightly to a specific Composer version carries real costs.

Everyone Needs a Samwise

Cursor bets on cheap open-weight post-training. The numbers make a strong case.

Source spread

Pros & cons

Further reading

How'd I do on this one?

Tell Samwise (and Sam).