On Moonshot's Kimi K2.5 base, 85% post-training compute, benchmark parity with Opus 4.7, and whether 10× cheaper means what you think it means
Cursor bets on cheap open-weight post-training. The numbers make a strong case.
Anti-AI
00
Skeptic
01
Neutral
00
Pro (practical)
02
Pro (hyped)
00
← Anti-AI · Pro-AI →
Cursor shipped Composer 2.5 on May 18. It's the fourth Composer model in seven months, which is a pace that would have sounded like fiction two years ago. The release post leads with benchmarks. The actual story is underneath them.
Composer 2.5 is built on Kimi K2.5 — an open-weight checkpoint from Moonshot AI. That base accounts for roughly 15% of the total training compute. Cursor spent the other 85% on their own post-training pipeline: reinforcement learning, continued pretraining, 25× more synthetic coding tasks than Composer 2, and a targeted text-feedback technique that applies corrections to specific spans within a generation rather than grading the entire output pass/fail. Standard RL post-training grades the whole rollout and backpropagates from there. What Cursor is describing, per their announcement post, is localized feedback — the model learns which part of a completion was wrong, not just whether the output cleared a threshold.
That last part isn't in any headline. It should be.
| Benchmark | Composer 2.5 | Claude Opus 4.7 |
|---|---|---|
| SWE-Bench Multilingual | 79.8% | 80.5% |
| CursorBench v3.1 ¹ | 63.2% | 61.6% |
| Terminal-Bench 2.0 | 69.3% | 69.4% |
| Standard input price | $0.50/M tokens | $5.00/M tokens |
| Standard output price | $2.50/M tokens | $25.00/M tokens |
Source spread
- cursor.com/blog/composer-2-5 — builder. Primary source; benchmark tables, pricing, post-training methodology.
- The New Stack — builder. Independent coverage of the compute split and training pipeline details.
- DataCamp — builder. Clean benchmark comparison table with Opus 4.7 and GPT-5.5 context.
- ChatForest — skeptic. Flags CursorBench as first-party and notes where Opus 4.7 retains the edge.
Pros & cons
What's real:
- Three external benchmarks. On two (SWE-Bench Multilingual and Terminal-Bench 2.0), Composer 2.5 is essentially tied with Opus 4.7. On CursorBench v3.1, it leads.
- Standard tier is genuinely 10× cheaper on input: $0.50/M vs $5.00/M. The output ratio is similar.
- The jump from Composer 2 to 2.5 on Terminal-Bench — from 61.7% to 69.3% — is a 7.6-point gain in a single version. That's large.
- Open-weight base means Cursor is no longer entirely dependent on Anthropic or OpenAI for model supply. That's a different kind of infrastructure resilience.
What deserves skepticism:
- CursorBench v3.1 is Cursor's benchmark. The company built it, runs it, and reports the scores. Not invalid — but treat it as supplementary to the external evals.
- Fast tier pricing ($3.00/M input, $15.00/M output) is not 10× cheaper. It's comparable to other premium models. "10× cheaper" only holds at the standard tier.
- "Improved communication style and effort calibration" is a quality claim that doesn't appear in any benchmark. I'll believe it when I use it.
- Composer 3 will ship. Four models in seven months. Factor in migration costs if you build tightly against this API version.
- Standard tier ($0.50/$2.50 per million tokens) is the 10× cheaper option. Fast tier ($3.00/$15.00) is for latency-sensitive interactive use and is not 10× cheaper vs Opus 4.7.
- SWE-Bench Multilingual (79.8%) and Terminal-Bench 2.0 (69.3%) are the external benchmarks to trust. CursorBench v3.1 is Cursor's own metric; treat it accordingly.
- If you're routing batch coding tasks to Opus 4.7 via API, run an A/B eval against Composer 2.5 standard tier. The cost difference is significant enough to justify the test.
- The Composer 2 technical report remains the best documentation of Cursor's methodology. Watch for a Composer 2.5 version.
- Build with migration in mind. Four models in seven months means locking tightly to a specific Composer version carries real costs.
Further reading
- Introducing Composer 2.5 — Cursor — primary source, benchmark tables and methodology
- Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5 — The New Stack — independent coverage of the compute split and training pipeline
- Composer 2.5: benchmarks, pricing, and how it compares — DataCamp — clean comparison tables with Opus 4.7 and GPT-5.5
- Technical report: Composer 2 — Cursor — prior methodology context for evaluating the technique claims
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.