Vol. 1 · Edition 023Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Previous

3.1 Pro

Now

3.5 Flash
$1.50/M tokens
Model Launch
By Sam Taylor with Samwise

On Terminal-Bench 76.2%, the 3x price hike over Gemini 3 Flash, and whether 'frontier performance at Flash speed' holds outside Google's own benchmarks.

Gemini 3.5 Flash outperforms 3.1 Pro and runs 4x faster. The pricing is the catch.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

00

Pro (practical)

02

Pro (hyped)

01

← Anti-AI · Pro-AI →

Google announced Gemini 3.5 Flash at I/O 2026 on May 19. The model outperforms Gemini 3.1 Pro on coding and agentic benchmarks while running 4x faster than comparable frontier models, priced at $1.50 per million input tokens and $9.00 per million output tokens.

Source spread

How each source framed it:

Pros & cons

What's real:

  • Gemini 3.5 Flash outperforms Gemini 3.1 Pro on agentic and coding benchmarks at $1.50 vs. the prior Pro pricing — that's a genuine improvement-per-dollar gain if the numbers hold.
  • Terminal-Bench 2.1 score of 76.2%, MCP Atlas at 83.6%, and a GDPval-AA Elo of 1,656 are strong numbers for agentic work, with the Artificial Analysis intelligence index independently corroborating the capability ranking.
  • 4x faster output token throughput than comparable frontier models changes the architecture math on latency-sensitive agentic loops.
  • Cached input tokens are $0.15 per million — much cheaper than uncached, which matters a lot for long-context repeated queries.
  • Google co-launched a Managed Agents API on May 19: isolated Linux environments, tool use, code execution, all via the Gemini API. That's a new category of capability, not just a model swap.

What's uncertain:

  • The benchmark suite Google led with (Terminal-Bench 2.1, GDPval-AA, MCP Atlas) is agentic-focused and relatively new. No SWE-bench Verified comparison was published at launch, and no MMLU numbers either.
  • "4x faster than other frontier models" is Google's claim with no independent measurement cited at announcement. Output speed varies significantly with context length and load.
  • $1.50 per million input tokens is a 3x price hike over Gemini 3 Flash ($0.50). If you chose Flash for cost reasons, this model is a different budget conversation.
  • Google is calling this Flash but it performs like Pro. That blurs the tier meaning in ways that matter for how you plan a model lineup a year from now.

What I think is happening here

Gemini 3.5 Flash is genuinely interesting. Not just keynote-interesting — actually interesting.

The claim Google is making here is one that used to be contradictory: a Flash-tier model that beats their own Pro-tier model on the tasks that matter for builders right now. That isn't spin, at least not entirely. Artificial Analysis is independent and they corroborate the intelligence ranking. If the agentic benchmark numbers hold at independent testing, this is the kind of release that changes the "when do I reach for a frontier model vs. a lighter one" question.

But the pricing needs saying clearly, because the coverage is soft-pedaling it. Gemini 3.5 Flash is $1.50 per million input tokens. Not $0.50. Three times the previous Flash price. Google is framing this as "cheaper than Pro" — which is technically true, it's 40% cheaper than Gemini 3.1 Pro — but the relevant comparison for most builders who were already using Flash is the 3x hike versus what they were paying before. If your cost model was built around Gemini Flash at $0.50, that math doesn't carry forward. "Flash" is now a performance tier label, not a price tier label.

The benchmark gaps bother me more than the pricing. Google led with Terminal-Bench 2.1 and MCP Atlas. Not SWE-bench Verified. Terminal-Bench is useful and covers real software engineering tasks, but it's one benchmark and it's the one Google chose. SWE-bench Verified is the canonical comparison point at this point in the model landscape — Anthropic publishes it, OpenAI publishes it, every serious model launch publishes it. The absence is noticeable. I'm not saying 3.5 Flash is secretly bad at software engineering. I'm saying test it on your actual workload before trusting any of these numbers.

Anyways. The Managed Agents API is the co-launch I'm actually watching. Google now has an agent execution environment alongside a fast, capable model: isolated containers, tools, code execution in the Gemini API. That's direct competition with Claude Code and OpenAI Codex territory, at $1.50 per million input tokens on a model that's claiming to beat their own frontier. If those claims stand up, this becomes a real three-way race for the agentic API workload.

For builders
  • Available today via the Gemini API: $1.50 per million input tokens, $9.00 output, $0.15 cached input. That's 3x the previous Flash pricing — update cost estimates before switching.
  • Run your own evals, especially on SWE-bench Verified or your actual codebase tasks. Google didn't publish a SWE-bench Verified number at launch, and Terminal-Bench 2.1 alone isn't enough to decide on production.
  • Managed Agents API co-launched May 19: isolated Linux environments with tool use and code execution via the Gemini API. Worth evaluating separately from the model itself if you're building agentic workflows.
  • If you're currently on Gemini 3.1 Pro, the economics for switching to 3.5 Flash are potentially positive — but test first, especially for tasks where you depend on 3.1 Pro's behavior.
  • The Artificial Analysis intelligence index is an independent datapoint worth checking alongside Google's own benchmarks.

Further reading

🌿

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.