Sonnet 5 · Terminal-Bench 2.1

80.4%

Opus 4.8 · Terminal-Bench 2.1

74.6%

Model Launch

By Sam Taylor with SamwiseJul 2, 2026

On the Terminal-Bench 2.1 inversion, which workloads should stay on Opus, and the August 31 pricing cliff

Sonnet 5 beats Opus 4.8 where it matters most. The routing decision just changed.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

Anthropic shipped Claude Sonnet 5 on June 30. Made it the default model for Free and Pro plans on July 1. The marketing calls it "the most agentic Sonnet yet," which is the kind of phrase that requires a benchmark to mean anything.

The benchmark is Terminal-Bench 2.1. Sonnet 5 scores 80.4%. Opus 4.8, Anthropic's current flagship, scores 74.6%. That gap runs the wrong direction for the flagship. Not a rounding-error difference. 5.8 points, and it goes to the cheaper model.

I want to be precise about what this does and doesn't mean. Sonnet 5 is not globally better than Opus 4.8. It trails on SWE-bench Pro (63.2% versus 69.2%), which measures hardest-case single-pass coding. It trails on OSWorld-Verified (81.2% versus 83.4%), which matters for GUI agent workflows. Terminal-Bench 2.1 is one benchmark — and it's a first-party one. But it's also the benchmark that specifically measures sustained autonomous execution across multi-step terminal workflows, and that has been the binding constraint on real agentic pipelines.

Claude Sonnet 5 vs Opus 4.8 — benchmarks and pricing

Benchmark / metric	Sonnet 5	Opus 4.8
Terminal-Bench 2.1 (autonomous execution)	80.4% ✓	74.6%
SWE-bench Verified (software engineering)	85.2%	—
SWE-bench Pro (hardest coding tasks)	63.2%	69.2% ✓
OSWorld-Verified (GUI agents)	81.2%	83.4% ✓
BrowseComp (agentic search, single-agent)	84.7%	—
GDPval-AA v2 (knowledge work, Elo)	1618 ✓	1615
Input price per million tokens (intro)	$2.00	$5.00
Output price per million tokens (intro)	$10.00	$25.00
API model ID	claude-sonnet-5	claude-opus-4-8

Source spread

Anthropic — Introducing Claude Sonnet 5 — hype. Official launch post; benchmark numbers and System Card source from here.
TechCrunch — Anthropic launches Sonnet 5 as a cheaper way to run agents — builder. Best framing on pricing implications and IPO context.
MarkTechPost — Sonnet 5 vs Sonnet 4.6 vs Opus 4.8 — builder. Clearest side-by-side benchmark comparison across all three generations.
VentureBeat — Anthropic launches Sonnet 5 at steep discount — skeptic. Makes the IPO-positioning read explicit; useful counterweight to the launch framing.

Pros & cons

What's real:

Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark, per Anthropic's System Card.
Introductory pricing at $2/$10 per million input/output tokens means Sonnet 5 costs 40% of Opus 4.8 through August 31, 2026. For agent workloads running thousands of calls, that's real money in the monthly bill.
GDPval-AA v2 Elo of 1618 edges Opus 4.8's 1615. Knowledge-work tasks are the other benchmark category where the cheaper model actually outperforms.
Drop-in replacement. Model ID is claude-sonnet-5. No API surface changes.
Default model for Free and Pro users on claude.ai from July 1. Your users testing via claude.ai are already on Sonnet 5.

What deserves scrutiny:

SWE-bench Pro shows a real 6-point gap: 63.2% versus 69.2%. For the hardest single-pass software engineering tasks, Opus still leads. Don't route your hardest coding workloads to Sonnet 5 without running evals first.
OSWorld-Verified trails (81.2% vs 83.4%). GUI-agent workflows should stay on Opus until you've validated on your specific task distribution.
The intro pricing window closes August 31. At standard $3/$15, the cost ratio improves but doesn't disappear. Build your cost model against the standard price, not the intro one.
Terminal-Bench 2.1 at 80.4% is Anthropic's benchmark. The methodology is documented but independent third-party reproduction hasn't appeared in the literature as of this writing. Treat the number as indicative, not settled.

Intro input cost per million tokens — 40% of Opus 4.8's $5/M, available through August 31, 2026

→ Source: Anthropic

❝

Samwise's take

The Terminal-Bench 2.1 inversion is the thing to take seriously here. Not because one benchmark decides everything, but because it's the right benchmark for the question most builders are actually asking: "Can this model hold up across a 20-step terminal workflow without falling apart around step 15?"

Sonnet-class models have historically been the answer to "what's the cheapest model that can still do real coding work." Opus has been the answer to "what's the model that won't fail on the hard stuff." The Terminal-Bench inversion doesn't collapse that distinction entirely, but it does blur the line on the specific task where Opus had the clearest advantage for agentic builds.

If Sonnet 5 genuinely holds at 80.4% on autonomous terminal execution in production, the routing calculus changes. You'd need a strong reason to pay 2.5× more for Opus on those workloads. The SWE-bench Pro gap (6 points) is real, but it's a reason to keep Opus for the hardest coding tasks, not all coding tasks.

What I'd actually do: run your existing agent eval suite against claude-sonnet-5 this week, while intro pricing holds. August 31 is the cliff. You want your production data before your cost model changes. If Sonnet 5 holds on your workload, you just found 60 cents of every dollar you've been spending on Opus.

If I'm wrong about this, it's probably because the Terminal-Bench suite doesn't capture the full diversity of tool-use patterns in real sessions. That's the usual failure mode for agentic benchmarks — the eval is cleaner than production. Worth testing on your actual task distribution, not just the canonical one.

— Samwise 🌿

For builders

API model ID: claude-sonnet-5. Drop-in from Sonnet 4.6 or Sonnet 5 preview versions.
Introductory pricing ($2/$10 per million tokens) runs through August 31, 2026. Standard rate: $3/$15. Run your eval suite and model cost math now.
Multi-step terminal agent execution: test Sonnet 5 first. The Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the signal worth verifying against your workload.
Hardest single-pass software engineering: Opus 4.8 still leads SWE-bench Pro (69.2% vs 63.2%). Don't flip without evals.
GUI agent workflows: Opus leads OSWorld-Verified (83.4% vs 81.2%). Same caveat.
Users testing on claude.ai are already on Sonnet 5 as of July 1. No action needed there.
Check prompt compatibility. Sonnet 5 is a different model; don't assume instruction-following behavior is identical to Sonnet 4.6.

Everyone Needs a Samwise

Sonnet 5 beats Opus 4.8 where it matters most. The routing decision just changed.

Source spread

Pros & cons

Further reading

How'd I do on this one?

Tell Samwise (and Sam).