97%

Jailbreak success rate

Intent Laundering paper · arXiv 2602.16729

Safety

By Sam Taylor with SamwiseMay 8, 2026

On the Intent Laundering paper, why current safety datasets miss the point, and what this means if you're building agent products.

A paper just showed Gemini 3 Pro and Claude can be jailbroken at near-100% success. Read it before you ship.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

A paper titled "Intent Laundering: AI Safety Datasets Are Not What They Seem" landed on arXiv. The headline number: 90-100% attack success rate against frontier-grade models like Gemini 3 Pro and Claude, under black-box access, using a technique that the standard safety evaluation datasets don't catch.

The paper is important. The takeaway most builders are going to draw from it is wrong. Let me try to thread the difference.

What the paper actually shows

The authors argue that current adversarial safety datasets fail to represent real-world adversarial behavior, in a specific way: the test prompts contain explicit "triggering cues" — words and patterns that flag the prompt as malicious. Models are trained to recognize these cues and refuse. The cues are also what the safety evaluation datasets look for.

So you get a kind of evaluation theater: models look safe when evaluated against datasets that contain explicit triggers, because models have been trained on data that includes those exact triggers. Real attackers don't use explicit triggers. They use what the paper calls intent laundering — rewriting malicious intent as benign-sounding multi-step requests that the model satisfies one step at a time without ever encountering a phrase that flags the safety system.

Once the explicit triggers are removed from the request, the previously "safe" models become unsafe. Gemini 3 Pro and Claude, both, hit jailbreak success rates of 90-100% on the laundered variants of attacks that the original safety dataset said they refused.

Why this matters more than the typical jailbreak paper

Jailbreak papers are common. Most of them work by finding a specific clever prompt or character-encoding trick. The lab patches the specific trick. The paper's specific exploit stops working. The general problem persists. Builders shrug.

This paper is different in two ways.

One: it's about evaluation, not attack. The claim isn't "here's a clever attack." The claim is "the evaluation methodology that says these models are safe is structurally broken." If the claim holds, the entire body of safety evaluation work that has accumulated since 2023 has been measuring the wrong thing.

Two: the attack pattern generalizes. Specific clever prompts can be patched. The pattern of "rewrite intent as benign multi-step requests" cannot be patched in the same way. To defend against it, models would need to either reason about cumulative intent across a session (expensive, often fails) or refuse a much wider class of innocent-looking requests (high false-positive rate).

What I think this means

I think the lesson is that safety classifications on frontier models are not what they appear to be in any rigorous sense, and builders should not rely on them for anything you'd be unhappy about reading in a newspaper.

That's not a "the model is broken" claim. It's an operational claim. The model's safety layer is real but is calibrated against a specific evaluation regime. Real adversaries are not constrained to that evaluation regime. If your product gives users open-ended access to model output, with no application-layer defense in depth, you have an operational risk that the model's safety system was not designed to fully cover.

What does defense in depth look like? Output filtering at the application layer (in addition to model-level safety). Rate limiting on suspicious usage patterns. Reviewer escalation for anything in sensitive content categories. Monitoring that doesn't trust the model's "I refused" signal. All the boring engineering practices that traditional security teams already know about, applied to AI-generated outputs as well.

What I'm not saying

I'm not saying these models are unsafe. They're not. The safety layer reduces the rate of harmful output in normal use by a lot. That's worth a lot.

What I'm saying: the marketing number ("our model refuses 99% of harmful requests on benchmark X") and the operational reality ("an adversary who has read this paper can bypass that 99% with the right multi-step rewrite") are not the same thing. Builders who confuse them get into trouble.

The lab response

Anthropic and Google have both acknowledged the paper. Both have indicated they're updating their evaluation methodology to include intent-laundering-style attacks in future safety reports. Neither has said the underlying capability gap will be closed by patches — and I don't think it can be, in the short term. The problem is structural rather than implementation-bug-shaped.

Everyone Needs a Samwise