Tokens · GA · Vertex AI

Gemini 3.1 Pro · enterprise-grade long-context

Model Launch

By Sam Taylor with SamwiseMay 5, 2026

On the 2M context number, when long context actually matters, and the document caching feature that's the real announcement.

Gemini 3.1 Pro is GA with a 2-million-token context. Most builders should ignore it.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

Google made Gemini 3.1 Pro generally available on Vertex AI, bringing 2-million-token context to production workloads with new features including document-level caching and native video understanding. The press picked up the 2M context number. The 2M context number is the wrong thing to optimize for.

Let me explain.

Why 2M context is not the headline you think it is

A 2-million-token context window means you can stuff approximately 1.5 million words into a single prompt. That's roughly two copies of "Anna Karenina." Or a quarter of all Linux kernel source code. Or every email you've sent in the last decade, several times over.

Hold that in your head, then ask: when would I actually want to do that?

The honest answer for 95% of builders is: never. Models with very long contexts have well-documented degradation patterns — accuracy drops in the middle of the context, attention thins out, the model starts paying more attention to the top and bottom of the prompt than the middle. Practical advice for years has been to use retrieval (RAG) to pull relevant pieces into a focused context rather than dumping everything in and hoping.

That advice hasn't changed because Google extended the context to 2M. The model can technically read 2M tokens. The model is not reliably good at using all 2M of those tokens.

For most use cases, a 200k-token context window plus good retrieval beats a 2M-token context window without retrieval. The cases where 2M is genuinely better are narrow: certain whole-codebase analyses, certain long-document multi-pass reasoning, certain video understanding workflows. If you're not in one of those narrow categories, the 2M number is a marketing artifact, not a capability that helps you.

What the actual announcement is

The actual announcement, buried under the 2M-context headline, is document-level caching.

Document-level caching means you can pre-compute the model's representation of a large document once and reuse it across many subsequent queries. If you're running an application where users repeatedly query against the same large corpus — a legal team querying a contract, a research team querying a paper set, an analyst querying a financial filing — caching changes the economics dramatically. You pay the long-context cost once instead of every query.

That's a real builder feature. It's the kind of operational improvement that quietly unlocks new product categories. Caching turns "long context is impressive but too expensive" into "long context is impressive AND affordable per query." That's the unlock.

The pricing on cached contexts is meaningfully cheaper than non-cached. Google's numbers (you should verify against current pricing): caching can reduce per-query token costs by 50-80% on cached prefixes. That's the part of the announcement to care about.

Native video understanding

The other genuine feature: native video understanding. Gemini 3.1 Pro processes video as a first-class input rather than extracting frames and processing them as images. This sounds like a marginal improvement. In practice, it changes what video-comprehension applications are feasible.

If you've built anything that processes video — accessibility tools, content moderation, surveillance analysis, sports analytics — the frame-extraction-and-process pipeline has been the bottleneck. Native video as input removes that bottleneck. Whether your application benefits depends on whether you're CPU-bound on frame extraction or model-quality-bound on per-frame understanding. For many applications, the frame-extraction was the bottleneck. Those applications now look very different.

When you should use Gemini 3.1 Pro

If you're picking between Gemini 3.1 Pro and Claude Opus 4.7 right now, the practical tiebreakers I've watched teams settle on:

You're already deep in Google Cloud: Vertex AI integration is meaningfully better than the equivalent for Anthropic on Bedrock. Stay on Google.
You need video understanding: Gemini 3.1 Pro is the frontier choice. Anthropic's video story is weaker.
You need 1M+ context for a specific workload: Gemini wins. Anthropic's 1M context exists but the practical degradation curve favors Gemini at the long end.
You need the best general coding: Claude Opus 4.7 wins. The Cursor team chose Claude as default for a reason.
You need strong instruction following on multi-step agentic workflows: It's closer than a year ago but I'd still lean Claude.

This is not a one-model-beats-all-models market in 2026. It hasn't been for a year. Pick by workload.

Everyone Needs a Samwise