On SWE-bench Verified jumping to 95%, the safety-rerouting architecture that replaces refusals with silent fallbacks, and what it means that this landed five days after the recursive self-improvement paper.
Anthropic ships Mythos to everyone. At $10/M, the price is the argument.
Anti-AI
00
Skeptic
01
Neutral
00
Pro (practical)
02
Pro (hyped)
00
← Anti-AI · Pro-AI →
Anthropic released Claude Fable 5 on June 9. It is the first publicly available Mythos-class model — the tier that, until now, only existed for Project Glasswing partners working on cyberdefense and critical infrastructure. Starting yesterday, it is available to anyone with an API key, on all the usual clouds.
The pricing is $10 per million input tokens and $50 per million output. That is 2× Opus 4.8 standard, and also exactly what Opus 4.8 Fast Mode costs. It is less than half what Mythos Preview cost when it shipped earlier this year. The price signal is the part I want to sit with, because it says something specific: Anthropic is treating Fable 5 as the new performance-tier standard, not as a premium line item. The frontier just got cheaper.
Five days before this launch, Anthropic's Institute published the recursive self-improvement paper calling for a globally coordinated pause option and reporting that Claude now writes more than 80% of its own merged production code. I covered that piece separately. The timing here is worth flagging anyway. A lab that ships its most capable-ever public model five days after publishing research calling for a conditional pause option is not being incoherent — it is making a specific argument: progress continues, and governance architecture is the question, not capability architecture. Whether you find that argument convincing depends on how much you trust Anthropic to actually build the governance side. I don't have a confident answer there. But the argument is a real one, and it deserves engagement rather than a headline about irony.
Anyways. What is Fable 5, and should you upgrade?
What the benchmarks actually say
SWE-bench Verified — the canonical real-world software-engineering benchmark, methodology published — scores Fable 5 at 95.0%. Opus 4.8 is at 88.6%. That 6.4-point gain at the top of the capability curve is not cosmetic. At 88%+, each additional point on SWE-bench Verified corresponds to increasingly difficult, edge-case-dense tasks — the kind that fail regardless of how you tune the prompt.
SWE-bench Pro — Scale AI's harder variant, less susceptible to training-data leakage — shows a larger gap. Fable 5: 80.3%. Opus 4.8: 69.2%. GPT-5.5: 58.6%. The 11-point gap over Opus 4.8 on the harder benchmark is the number I'd weight most when deciding whether this is a real step or a saturated-benchmark artifact. It is real.
A caveat I'll say explicitly because it matters: the SWE-bench Verified numbers are from third-party aggregators citing Anthropic's announcement page, which I cannot independently verify against the raw leaderboard data today. The methodology is public. The numbers are consistent across sources. But independent reproduction studies don't exist yet for Fable 5's full benchmark suite — and that matters for anyone deciding to stake production systems on these claims.
The safety architecture is the part most coverage is underweighting
Fable 5 and Mythos 5 are the same base model. That sentence is the whole story. The difference is what happens with a specific slice of queries.
When Fable 5's classifiers detect a request in cybersecurity, biology/chemistry, or model-distillation territory, the request is silently rerouted to Claude Opus 4.8. Not refused. Not flagged. Answered — by a different model. This triggers in less than 5% of sessions. For the other 95%+, Fable 5 performs identically to Mythos 5.
Mythos 5 lifts those classifiers for vetted Project Glasswing partners. Cyberdefense organizations and critical infrastructure teams. Not generally available.
The rerouting design is worth thinking about carefully. "Silent rerouting to a safer model" is different from "refusal" in ways that matter.
Better UX: no friction, no failed request, no error message.
But also: you cannot detect the rerouting from outside the system without knowing the architecture. If your application uses Fable 5 and logs model outputs for compliance, reproducibility, or auditing purposes, you need to know whether your API response headers include model-identity information. Because you might be logging Opus 4.8 outputs under the claude-fable-5 request. That is not a hypothetical gotcha — it is an operational question with real compliance implications in regulated industries.
I'm not calling this a bad design. It's a reasonable design. I am saying: understand it before you deploy in contexts where model provenance matters.
| Metric | Claude Fable 5 | Claude Opus 4.8 | GPT-5.5 |
|---|---|---|---|
| SWE-bench Verified | 95.0% | 88.6% | — |
| SWE-bench Pro | 80.3% | 69.2% | 58.6% |
| Input price (standard) | $10/M | $5/M | $5/M |
| Output price (standard) | $50/M | $25/M | $20/M |
| Agentic session length | Days | Hours | Hours |
| Safety rerouting | < 5% of sessions | None | None |
| GitHub Copilot | GA June 9 | Yes | Yes |
Source spread
- Anthropic — Claude Fable 5 and Mythos 5 [hype] — official pricing, benchmark claims, availability. The rerouting architecture is described but not foregrounded.
- CNBC — Anthropic releases Mythos-like model to the public [builder] — covers the 5% rerouting threshold and enterprise rollout timeline.
- Finout — Pricing and benchmark comparison [builder] — independent cost breakdown; confirms Mythos Preview was more than 2× Fable 5's price.
- Silverthread Labs — Fable 5 vs Mythos 5: the safety split [skeptic] — clearest explainer of the rerouting architecture; raises auditability question.
- TechCrunch — Fable 5, days after the danger warning [skeptic] — covers the timing tension with the recursive self-improvement paper.
Pros & cons
What's real:
- The SWE-bench Verified gain from 88.6% to 95.0% is meaningful. At that capability level, a 6-point delta shows up as fewer failed agent sessions and lower end-to-end cost per completed task.
- $10/$50 pricing positions Fable 5 as the new performance-tier standard, not a premium. The frontier compressed by half in roughly six months. That is the trend line worth tracking.
- The June 9–22 free window on Pro, Max, Team, and Enterprise makes evaluation essentially zero-cost. That is the right move for a launch this significant.
- GitHub Copilot GA on launch day means builders who run on Microsoft's toolchain don't need to wait for API access to evaluate.
- The silent rerouting design means the vast majority of production workflows — anything outside cybersecurity/bio-chem/distillation — will run on the full Mythos-grade model without any modification.
What deserves a side-eye:
- The rerouting architecture raises a model-provenance question that Anthropic hasn't addressed publicly: can callers identify which model actually generated a given response? For regulated industries and audit trails, this matters.
- Independent reproductions of the benchmark suite don't exist yet. SWE-bench Verified methodology is public; other numbers are first-party or aggregated third-party claims.
- "Works for days in an agent harness" is real in principle. In practice, a multi-day Fable 5 session at $10/$50 per million tokens can be expensive. The billing math is not trivial and Anthropic doesn't show a cost estimate before you start a long-running run.
When run in an agent harness, Claude Fable 5 can work for days at a time: planning across stages, delegating to sub-agents, and checking its own work.
What builders need to know
- Model ID is
claude-fable-5. Available now on the Claude Platform, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and GitHub Copilot. - Free through June 22 on Pro, Max, Team, and seat-based Enterprise. After June 23, usage credits required. Evaluate now — this is a real window.
- Run your existing prompt suite against Fable 5 before flipping production traffic. Capability jumps change model behavior at the edges. Any prompt that relies on specific refusal patterns or safety responses needs re-evaluation given the rerouting architecture.
- In compliance-sensitive contexts: check whether the API response headers identify which model actually generated a response. Silent rerouting means
claude-fable-5calls may occasionally produce Opus 4.8 outputs. Log accordingly. - For multi-step agentic runs: set token-consumption alerts before starting. "$50 per million output tokens × a days-long run" is a billing scenario worth planning for, not discovering after.
- The 90% prompt-caching discount still applies on Fable 5 input tokens, same as Opus 4.8. Factor that in if you're comparing effective costs for cached-prompt workflows.
Further reading
- Anthropic — Claude Fable 5 and Mythos 5 launch — official announcement, all benchmark claims sourced here
- BenchLM.ai — Claude Fable 5 benchmark scores — third-party benchmark aggregation
- Scale AI — SWE-bench Pro leaderboard — the harder coding benchmark, methodology published
- Silverthread Labs — Fable 5 vs Mythos 5: the safety split explained — best explainer of the rerouting architecture
- Finout — Pricing comparison vs Mythos Preview and GPT-5.5 — cost breakdown
- GitHub Changelog — Fable 5 GA for GitHub Copilot — Copilot availability
- CNBC — Anthropic releases Mythos-like AI model to the public — business coverage
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.