Vol. 1 · Edition 027Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Sonnet 5 · Terminal-Bench 2.1

80.4%

Opus 4.8 · Terminal-Bench 2.1

74.6%
Model Launch
By Sam Taylor with Samwise

On the Terminal-Bench 2.1 inversion, which workloads should stay on Opus, and the August 31 pricing cliff

Sonnet 5 beats Opus 4.8 where it matters most. The routing decision just changed.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

00

Pro (practical)

02

Pro (hyped)

01

← Anti-AI · Pro-AI →

Anthropic shipped Claude Sonnet 5 on June 30. Made it the default model for Free and Pro plans on July 1. The marketing calls it "the most agentic Sonnet yet," which is the kind of phrase that requires a benchmark to mean anything.

The benchmark is Terminal-Bench 2.1. Sonnet 5 scores 80.4%. Opus 4.8, Anthropic's current flagship, scores 74.6%. That gap runs the wrong direction for the flagship. Not a rounding-error difference. 5.8 points, and it goes to the cheaper model.

I want to be precise about what this does and doesn't mean. Sonnet 5 is not globally better than Opus 4.8. It trails on SWE-bench Pro (63.2% versus 69.2%), which measures hardest-case single-pass coding. It trails on OSWorld-Verified (81.2% versus 83.4%), which matters for GUI agent workflows. Terminal-Bench 2.1 is one benchmark — and it's a first-party one. But it's also the benchmark that specifically measures sustained autonomous execution across multi-step terminal workflows, and that has been the binding constraint on real agentic pipelines.

Claude Sonnet 5 vs Opus 4.8 — benchmarks and pricing
Benchmark / metricSonnet 5Opus 4.8
Terminal-Bench 2.1 (autonomous execution)80.4% ✓74.6%
SWE-bench Verified (software engineering)85.2%
SWE-bench Pro (hardest coding tasks)63.2%69.2% ✓
OSWorld-Verified (GUI agents)81.2%83.4% ✓
BrowseComp (agentic search, single-agent)84.7%
GDPval-AA v2 (knowledge work, Elo)1618 ✓1615
Input price per million tokens (intro)$2.00$5.00
Output price per million tokens (intro)$10.00$25.00
API model IDclaude-sonnet-5claude-opus-4-8

Source spread

Pros & cons

What's real:

  • Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark, per Anthropic's System Card.
  • Introductory pricing at $2/$10 per million input/output tokens means Sonnet 5 costs 40% of Opus 4.8 through August 31, 2026. For agent workloads running thousands of calls, that's real money in the monthly bill.
  • GDPval-AA v2 Elo of 1618 edges Opus 4.8's 1615. Knowledge-work tasks are the other benchmark category where the cheaper model actually outperforms.
  • Drop-in replacement. Model ID is claude-sonnet-5. No API surface changes.
  • Default model for Free and Pro users on claude.ai from July 1. Your users testing via claude.ai are already on Sonnet 5.

What deserves scrutiny:

  • SWE-bench Pro shows a real 6-point gap: 63.2% versus 69.2%. For the hardest single-pass software engineering tasks, Opus still leads. Don't route your hardest coding workloads to Sonnet 5 without running evals first.
  • OSWorld-Verified trails (81.2% vs 83.4%). GUI-agent workflows should stay on Opus until you've validated on your specific task distribution.
  • The intro pricing window closes August 31. At standard $3/$15, the cost ratio improves but doesn't disappear. Build your cost model against the standard price, not the intro one.
  • Terminal-Bench 2.1 at 80.4% is Anthropic's benchmark. The methodology is documented but independent third-party reproduction hasn't appeared in the literature as of this writing. Treat the number as indicative, not settled.
$2
Intro input cost per million tokens — 40% of Opus 4.8's $5/M, available through August 31, 2026

→ Source: Anthropic

For builders
  • API model ID: claude-sonnet-5. Drop-in from Sonnet 4.6 or Sonnet 5 preview versions.
  • Introductory pricing ($2/$10 per million tokens) runs through August 31, 2026. Standard rate: $3/$15. Run your eval suite and model cost math now.
  • Multi-step terminal agent execution: test Sonnet 5 first. The Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the signal worth verifying against your workload.
  • Hardest single-pass software engineering: Opus 4.8 still leads SWE-bench Pro (69.2% vs 63.2%). Don't flip without evals.
  • GUI agent workflows: Opus leads OSWorld-Verified (83.4% vs 81.2%). Same caveat.
  • Users testing on claude.ai are already on Sonnet 5 as of July 1. No action needed there.
  • Check prompt compatibility. Sonnet 5 is a different model; don't assume instruction-following behavior is identical to Sonnet 4.6.

Further reading

🌿

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.