Vol. 1 · Edition 023Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Composer 2.5 (std / M tokens)

$0.50

Claude Opus 4.7 (input / M tokens)

$5.00
Tools & Infra
By Sam Taylor with Samwise

On Moonshot's Kimi K2.5 base, 85% post-training compute, benchmark parity with Opus 4.7, and whether 10× cheaper means what you think it means

Cursor bets on cheap open-weight post-training. The numbers make a strong case.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

00

Pro (practical)

02

Pro (hyped)

00

← Anti-AI · Pro-AI →

Cursor shipped Composer 2.5 on May 18. It's the fourth Composer model in seven months, which is a pace that would have sounded like fiction two years ago. The release post leads with benchmarks. The actual story is underneath them.

Composer 2.5 is built on Kimi K2.5 — an open-weight checkpoint from Moonshot AI. That base accounts for roughly 15% of the total training compute. Cursor spent the other 85% on their own post-training pipeline: reinforcement learning, continued pretraining, 25× more synthetic coding tasks than Composer 2, and a targeted text-feedback technique that applies corrections to specific spans within a generation rather than grading the entire output pass/fail. Standard RL post-training grades the whole rollout and backpropagates from there. What Cursor is describing, per their announcement post, is localized feedback — the model learns which part of a completion was wrong, not just whether the output cleared a threshold.

That last part isn't in any headline. It should be.

10×
Cheaper than Claude Opus 4.7 at standard tier ($0.50 vs $5.00 per million input tokens)

→ Source: Cursor blog

Composer 2.5 vs Claude Opus 4.7 — coding benchmarks
BenchmarkComposer 2.5Claude Opus 4.7
SWE-Bench Multilingual79.8%80.5%
CursorBench v3.1 ¹63.2%61.6%
Terminal-Bench 2.069.3%69.4%
Standard input price$0.50/M tokens$5.00/M tokens
Standard output price$2.50/M tokens$25.00/M tokens
¹ CursorBench v3.1 is Cursor's own internal benchmark. Weight accordingly. Source: cursor.com/blog/composer-2-5

Source spread

  • cursor.com/blog/composer-2-5builder. Primary source; benchmark tables, pricing, post-training methodology.
  • The New Stackbuilder. Independent coverage of the compute split and training pipeline details.
  • DataCampbuilder. Clean benchmark comparison table with Opus 4.7 and GPT-5.5 context.
  • ChatForestskeptic. Flags CursorBench as first-party and notes where Opus 4.7 retains the edge.

Pros & cons

What's real:

  • Three external benchmarks. On two (SWE-Bench Multilingual and Terminal-Bench 2.0), Composer 2.5 is essentially tied with Opus 4.7. On CursorBench v3.1, it leads.
  • Standard tier is genuinely 10× cheaper on input: $0.50/M vs $5.00/M. The output ratio is similar.
  • The jump from Composer 2 to 2.5 on Terminal-Bench — from 61.7% to 69.3% — is a 7.6-point gain in a single version. That's large.
  • Open-weight base means Cursor is no longer entirely dependent on Anthropic or OpenAI for model supply. That's a different kind of infrastructure resilience.

What deserves skepticism:

  • CursorBench v3.1 is Cursor's benchmark. The company built it, runs it, and reports the scores. Not invalid — but treat it as supplementary to the external evals.
  • Fast tier pricing ($3.00/M input, $15.00/M output) is not 10× cheaper. It's comparable to other premium models. "10× cheaper" only holds at the standard tier.
  • "Improved communication style and effort calibration" is a quality claim that doesn't appear in any benchmark. I'll believe it when I use it.
  • Composer 3 will ship. Four models in seven months. Factor in migration costs if you build tightly against this API version.
For builders
  • Standard tier ($0.50/$2.50 per million tokens) is the 10× cheaper option. Fast tier ($3.00/$15.00) is for latency-sensitive interactive use and is not 10× cheaper vs Opus 4.7.
  • SWE-Bench Multilingual (79.8%) and Terminal-Bench 2.0 (69.3%) are the external benchmarks to trust. CursorBench v3.1 is Cursor's own metric; treat it accordingly.
  • If you're routing batch coding tasks to Opus 4.7 via API, run an A/B eval against Composer 2.5 standard tier. The cost difference is significant enough to justify the test.
  • The Composer 2 technical report remains the best documentation of Cursor's methodology. Watch for a Composer 2.5 version.
  • Build with migration in mind. Four models in seven months means locking tightly to a specific Composer version carries real costs.

Further reading

🌿

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.