In February, four labs had a serious reasoning model. In May, eleven do — and three are open-weight. If you're shopping for a model in production, the decision just got harder and cheaper at the same time.
Reasoning models got commoditized in 90 days. The buyer's market just got brutal.
Anti-AI
00
Skeptic
01
Neutral
01
Pro (practical)
05
Pro (hyped)
00
← Anti-AI · Pro-AI →
I want to write something useful about model selection in 2026. The discourse has been pretty bad — every model launch generates a flurry of "best model ever" posts, and every reaction post argues a different model is actually best, and meanwhile the people who have to actually pick a model for production are getting whiplash. The real story isn't which model is best. It's that "best" stopped being the relevant question this quarter.
Here's the honest version of what just happened and what to actually do about it.
What's actually happening (the numbers)
The reasoning-model market consolidated and then exploded in two distinct phases over the last six months.
February 2026: four serious reasoning-mode models on the market — Claude Opus 4.5 (thinking mode), GPT-5 (reasoning), Gemini 2.5 Pro Thinking, and DeepSeek R1. Distinct quality tiers. Distinct pricing. Clear "premium" status for reasoning models versus their non-thinking siblings.
May 2026: eleven serious reasoning-mode models, including three open-weight (DeepSeek R2, Qwen 3 Reasoning, Llama 4 Reasoning), and the rest from Anthropic, OpenAI, Google, xAI, Mistral, Cohere, and AI21. The premium-tier framing is dissolving — reasoning is no longer a flagship feature, it's a checkbox.
Price-per-million-output-tokens dropped 6x from February to May for "good enough" reasoning capability. What cost $15/M in February costs $2.50/M now. What cost $75/M (premium tier) costs $15/M now. The high-end didn't get cheaper as dramatically; the middle dropped through the floor.
Quality differences shrank, then clustered. On Humanity's Last Exam, the top four reasoning models in February spanned a 14-point range. The top seven in May span 4 points. The differences below the top-4 cluster are smaller than the run-to-run variance of any single model. They're statistically real but practically invisible for most workloads.
The market structure shifted from "tiered by quality" to "tiered by shape." The shape questions — latency, cost, context window, structured-output reliability, tool-use behavior, multimodality — became the actual decision criteria in a single quarter.
Why the commoditization happened so fast
Three reasons, and they're each load-bearing.
Reason one: the technique got published. OpenAI's o1 launch in late 2024 made reasoning visible. By mid-2025 every major lab had a research path on it. By early 2026 the core technique — long chain-of-thought during inference, with reinforcement learning on chain quality — was well-documented enough that any lab with a competent post-training team could ship a viable version. The moat lasted about 18 months. That's short.
Reason two: open-weight caught up. DeepSeek R2 in March was the breakpoint. It was within 6 points of GPT-5 reasoning on most benchmarks, fully open-weight, and inference cost on commodity hardware was less than 1/10th the closed-source pricing. Once one open-weight reasoning model existed at near-frontier quality, the entire pricing strategy of the frontier labs had to recalibrate. Qwen 3 and Llama 4 reasoning followed within weeks.
Reason three: efficient inference techniques generalized. Speculative decoding, MoE routing for reasoning chains, and CoT-pruning techniques (cutting reasoning chains short when the model is confident) cut effective inference cost by another 2-3x across most providers between January and April. The cost reductions weren't from a single breakthrough; they were from compounding small improvements that the whole industry adopted in parallel.
The result is a market that, in May 2026, looks more like the late-stage cloud-compute market than the early-stage frontier-model market. Differentiation is on shape, not raw capability.
Who lost the most
The clearest losers in this transition are companies whose entire pricing strategy was based on reasoning being premium.
OpenAI's revenue mix took a hit because their reasoning-tier pricing was the highest-margin part of their stack and they had to drop it 60% to stay competitive on the API. Their consumer business is fine — ChatGPT is a brand, not a model — but their enterprise API margin compressed.
A second-order loss: AI startups that had built their valuation pitch around "we have privileged access to the best reasoning model." That pitch evaporates when six providers ship roughly equivalent reasoning capability. The differentiation has to be product, distribution, or data — same as any other software company. The "we're an AI startup" excuse for unclear positioning is dying faster than expected.
The third-order loss: middleware companies whose value proposition was "we route you to the best model." That value proposition only works if there's a clear best model. In a commoditized market, model routing is a margin-compression service, not a value-add. Several of these companies are pivoting hard.
The winners are anyone who built around the assumption that intelligence would become cheap and the differentiation would be in product/integration/distribution. That cohort is now correct, six months earlier than they thought.
What it means if you're picking a model for production
If you have a real workload that depends on a reasoning model in 2026, the decision criteria changed materially in the last quarter. Here's the buyer's checklist that actually matters now.
Criterion one: latency profile, not just speed
Reasoning models have a bimodal latency distribution that the non-thinking models don't. Average response time is one thing; p99 response time is often 4-8x average because the model decides to think harder on hard inputs. For a customer-facing application, p99 matters more than mean.
The May 2026 numbers:
- Claude Opus 4.7 thinking: mean ~2.4s, p99 ~14s
- GPT-5 reasoning: mean ~3.1s, p99 ~22s
- Gemini 3.1 thinking: mean ~2.0s, p99 ~9s
- DeepSeek R2 (self-hosted): mean ~3.8s, p99 ~28s
- Llama 4 Reasoning: mean ~3.2s, p99 ~18s
If your workload has a hard latency budget, Gemini 3.1 thinking is currently the best frontier model on p99. If you have flexibility on latency in exchange for cost, the open-weight options are dramatically cheaper to run yourself. The "best" model for your workload depends entirely on which side of that tradeoff you're on.
Criterion two: structured-output reliability
This is where the spread is still real. Reasoning models vary substantially on how well they obey structured-output constraints (JSON mode, function calling, schema adherence). On a 1,000-call eval, schema adherence:
- Claude Opus 4.7: 99.4%
- GPT-5: 98.1%
- Gemini 3.1: 97.6%
- DeepSeek R2: 91.2%
- Llama 4 Reasoning: 88.4%
If your workload is "model produces structured output that feeds into deterministic code downstream," the difference between 99.4% and 88.4% is the difference between "production ready" and "needs a retry layer." Pay for the reliability if your downstream stack depends on it.
Criterion three: cost-per-real-task, not cost-per-token
Tokens-per-task vary wildly across reasoning models because reasoning length varies. A "cheap" model that uses 4x more reasoning tokens isn't cheap. Run your own evals on representative tasks. The right metric is "dollars per completed task on my workload," not "dollars per million tokens at list price." For most workloads in May 2026:
- Claude Opus 4.7 thinking: $0.08-0.14 per representative task
- GPT-5 reasoning: $0.06-0.11
- Gemini 3.1 thinking: $0.04-0.08
- DeepSeek R2 (hosted): $0.02-0.04
- DeepSeek R2 (self-hosted, amortized): $0.005-0.012
The two-order-of-magnitude spread between the most-expensive frontier model and self-hosted open-weight is the actual story.
Criterion four: context window utilization, not size
Every major reasoning model now claims 1M+ context. The interesting question is "how well does it actually use 800k tokens of context for a real task?" The answer varies a lot. Recent needle-in-haystack and reasoning-over-context evals (May 2026):
- Claude Opus 4.7 (1M context): 94% retrieval accuracy at 800k
- Gemini 3.1 (2M context): 89% at 800k, 71% at 1.6M
- GPT-5 (1M context): 86% at 800k
- DeepSeek R2 (256k context): 92% at 200k
- Llama 4 Reasoning (1M context): 78% at 800k
If your workload depends on long-context reasoning (RAG over large corpora, long-form code review, multi-document synthesis), the spread here is much wider than the headline-benchmark spread. Don't pick on context-window size; pick on effective-context performance.
What to actually do
Three moves, applicable to anyone shipping a model-dependent feature in 2026.
Move one: run a real eval on your real workload, not a benchmark
The public benchmarks were always proxies. They're now bad proxies because the top of the distribution clusters too tightly. Spend two days building a 50-100 task eval set that reflects your actual workload — your actual prompts, your actual inputs, your actual definition of success. Run all 5-7 candidate models on that eval. The right answer is almost always different from the leaderboard answer.
Move two: design for substitution from day one
Pin a model in your code and you're committing to its pricing and capability trajectory. Build behind a thin abstraction layer that lets you swap providers in an afternoon. The cost of doing this now is low; the cost of not doing it is real once your workload scales. The Vercel AI SDK, OpenRouter, and a few similar tools make this nearly free. Use one.
Move three: prepare for further price drops
The trajectory through end of 2026 is probably another 3-5x reduction in cost-per-reasoning-task on the frontier, and another 5-10x on open-weight. If you build a unit economics model assuming today's prices, you're under-projecting margin gains. If you build one assuming today's quality, you're under-projecting capability gains. Both are coming.
What I think the next 12 months actually look like
Three predictions, each with reasonable confidence:
-
Reasoning becomes the default mode for serious work. "Reasoning mode" stops being a flag; it just becomes how the good models answer. Non-reasoning becomes the cheap-fast tier.
-
Open-weight catches frontier on most tasks. By late 2026, the gap between best open-weight and best frontier closes to under 3 points on most evals for most workloads. Companies with self-hosting capability dominate the cost-per-task metric.
-
Differentiation moves to specialization and product. General-purpose reasoning becomes a commodity. The next premium tier is specialized models — coding-specific, legal-specific, healthcare-specific, with fine-tuning and domain data as the moat. Whether this generates moats that hold remains an open question.
If I had to pick the single number that captures the situation: reasoning capability dropped from $15/M to $2.50/M in 90 days. That's not a market correction. That's commoditization. Plan accordingly.
What I'm not saying
I'm not saying "all models are the same now." On the high end, real differences still exist, especially in structured-output reliability, multimodal capability, and long-context performance. Pay attention to the differences that matter for your shape; ignore the ones that don't.
I'm also not saying "you should switch to the cheapest model." For most workloads in 2026 the cost difference matters less than the engineering cost of integration, the operational risk of switching, and the durability of the provider. Cheap-and-decent often beats expensive-and-best — but stable-and-decent usually beats cheap-and-decent in production. Pick the model that fits your shape, then renegotiate annually as prices keep dropping.
The reasoning-model market in May 2026 is what the database market was in 2015 — a lot of options, most of them good enough, with the decision driven by shape rather than capability. That's a much more boring market to read about, and it's a much better one to build on.
Further reading
- Artificial Analysis — May 2026 reasoning model leaderboard — current latency, cost, and capability data
- DeepSeek R2 technical report — the open-weight breakpoint
- Stratechery — when intelligence is cheap — market-structure implications
- Latent Space — model evaluation in production — the "build your own eval" thesis
- Vercel AI SDK provider docs — substitutable model integration
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.