On Zandieh et al.'s ICLR 2026 paper, the Red Hat AI / vLLM May 2026 accuracy study, and which configuration actually holds in production
Google's KV cache paper got one number wrong. The corrected version is still worth running.
Anti-AI
00
Skeptic
01
Neutral
02
Pro (practical)
02
Pro (hyped)
00
← Anti-AI · Pro-AI →
The KV cache is boring right up until it isn't.
You're serving a 70B model at 32k context. Traffic spikes. The H100 allocation hits the ceiling. Someone on your team digs up a Google Research preprint from April 2025 — TurboQuant, arXiv:2504.19874, accepted at ICLR 2026 — claiming 5× compression with near-zero accuracy loss at 3 bits per coordinate. The authors are Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni across Google Research, NYU, and Google DeepMind. Open-source implementations are landing in llama.cpp and the vLLM PR #38479 is merged.
So you dig in. And the first thing you find is that the 3-bit claim is where the paper and production diverge.
What TurboQuant actually does
The algorithm combines two ideas. PolarQuant applies a random rotation to key-value vectors before scalar quantization. Quantized Johnson-Lindenstrauss — QJL — adds a 1-bit residual correction on top. Together: KV cache compressed to 3 bits per element versus the standard 16-bit FP16. The math is genuinely sound. The paper proves the method sits within 2.7× of the information-theoretic distortion minimum for any given bit budget. Near-optimal by a rigorous definition.
The operational advantage is real too. Data-oblivious: no calibration dataset, no fine-tuning, no model-specific training. Load any GGUF-format transformer and the compression applies immediately.
The problem is the QJL residual step at low bit widths. At 3 bits, the residual correction introduces error that compounds across long attention sequences. Red Hat AI published a comprehensive evaluation on May 11 — Eldar Kurtić, Michael Goin, and Alexandre Marques — testing Llama-3.3-70B, Qwen3-30B-A3B, and MiniMax-M2.7 across five benchmarks: MRCR for long-context retrieval, AIME25 and GPQA for reasoning, MATH500, LiveCodeBench-v6.
The finding: 3-bit with QJL drops 15–25 points on reasoning at 128K+ context.
That's not "slightly worse." That's materially worse on the benchmarks where LLM-serving customers are increasingly running real workloads. Math tutoring. Code generation. Legal document analysis. These are reasoning tasks at long context. This is where the 3-bit claim breaks.
What the production configuration actually is
The Red Hat recommendation: use 4-bit without QJL. The _nc variant (norm correction, QJL disabled), first two and last two attention layers kept at full FP16 precision. That delivers 3–4× memory reduction — not 5× — with accuracy near full precision on both reasoning and long-context tasks.
| Configuration | Memory reduction | Reasoning accuracy at 128K+ | Recommendation |
|---|---|---|---|
| 3-bit with QJL | ~5× | 15–25 pt drop vs FP16 | Avoid for reasoning workloads |
| 4-bit without QJL (_nc) | ~3–4× | Near full precision | Production default |
| FP8 (baseline) | ~2× | Near full precision | If <2× compression is enough |
The useful math: for a cluster serving 70B+ models at 32K+ context, the annual savings estimate at 4-bit runs roughly $267,840 per serving cluster. At ten clusters, that's $2.6M/year from one configuration change and no retraining. The 5× paper number would be better. The 3–4× production number is still real money.
Source spread
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — arXiv [academic] The primary paper, ICLR 2026. Claims 5× compression near-zero loss at 3.5 bits. Information-theoretic proof is solid; production generalization is where the paper oversells.
- A First Comprehensive Study of TurboQuant — vLLM blog [builder] Red Hat AI, May 11, 2026. The most rigorous independent evaluation available. Three models, five benchmarks. Source of the 15–25 pt finding.
- TurboQuant: ~3-bit KV Cache with Near 0 Accuracy Loss? — kaitchup [skeptic] Early community evaluation flagging the QJL problem weeks before the Red Hat study. Accurate early read.
- OnlyTerp/turboquant — GitHub [builder] Open-source implementation with production configuration notes. First OSS release; the repo readme flags the 4-bit recommendation explicitly.
Pros & cons
What's real:
- 4-bit _nc achieves 3–4× memory reduction at accuracy near full precision. That is a genuine production improvement for high-throughput long-context serving.
- Data-oblivious compression with no calibration or training requirement. This separates TurboQuant from most quantization methods, which need model-specific calibration datasets and can drift with model updates.
- The mathematical foundation is rigorous. The information-theoretic bound is a real result, not marketing.
- Active integrations: llama.cpp PR is active, vLLM #38479 is merged, AMD ROCm variant exists. The ecosystem is building around 4-bit _nc configurations.
What deserves a side-eye:
- Google did not publish a comprehensive accuracy study with failure conditions alongside the ICLR paper. Only first-party benchmarks on shorter-context non-reasoning tasks. Red Hat did the work the paper should have included.
- The headline claim — near-zero accuracy loss at 3-bit — is technically true on the paper's specific test conditions, which happen to be the conditions where the failure mode doesn't surface. This is a common ML paper failure pattern, not unique to this group.
- The QJL residual correction, the clever part of the method, is also the part that breaks at low bit widths on reasoning tasks. That's an irony worth naming.
What builders need to know
- Use 4-bit
_nc(norm correction, QJL disabled), not 3-bit. The 3-bit configuration drops 15–25 points on reasoning at 128K+ context — per the Red Hat AI / vLLM May 2026 evaluation. - Keep the first and last two attention layers at full precision. These layers carry disproportionate signal; quantizing them degrades the model more than any other two layers.
- Sweet spot: 70B+ models at 32K–128K context length in high-throughput serving. Smaller models and shorter contexts change the calculus — run your own eval.
- If you only need 2× compression, FP8 is simpler and more battle-tested. TurboQuant is the right tool when >2× is the requirement.
- Active integrations to track: llama.cpp discussion #20969, vLLM PR #38479. Both are moving. AMD ROCm build exists via Pascal-SAPUI5/llama.cpp-turboquant.
- Do not trust "near-zero accuracy loss" from paper benchmarks alone. Run the vLLM evaluation suite against your workload. Reasoning tasks and long context are where the failure shows up.
Further reading
- TurboQuant: Online Vector Quantization — arXiv:2504.19874 — primary paper, ICLR 2026
- OpenReview — full ICLR 2026 proceedings version
- A First Comprehensive Study of TurboQuant — vLLM blog, May 11 2026 — the Red Hat AI evaluation; read this before deploying
- TurboQuant: ~3-bit KV Cache with Near 0 Accuracy Loss? — kaitchup — early community evaluation
- OnlyTerp/turboquant — GitHub — open-source implementation
- vLLM PR #38479 — merged vLLM integration
- Google TurboQuant cost analysis — Spheron — savings estimates per cluster
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.