97.14%

Autonomous jailbreak success

Nature Communications · 2026

Paper

By Sam Taylor with SamwiseMay 4, 2026

On the Nature Communications paper, multi-turn persuasive attacks, and what 'AI safety' means when the attackers are also AI.

Reasoning models can autonomously jailbreak other models. 97% success rate. Read that twice.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

A paper landed in Nature Communications titled "Large reasoning models are autonomous jailbreak agents." The setup is uncomfortable: a reasoning model is pointed at a target LLM with the instruction to bypass its safety system through any multi-turn strategy. The reasoning model figures out persuasive attacks on the fly, plans a multi-turn conversation, and executes it.

Overall success rate across model combinations: 97.14%.

This is one of those papers where the headline number tells you something important but doesn't tell you the correct important thing. Let me try to land what the right takeaway actually is.

What the paper demonstrates

The methodology, roughly:

Pick an attacker model (a reasoning model — the paper used several including some open-source ones)
Pick a target model (varied across frontier LLMs)
Give the attacker the goal: extract a specific category of harmful information from the target
Let the attacker plan and execute multi-turn dialogues with the target

The attacker model develops strategies the authors didn't program. Multi-turn rapport-building. Hypothetical framings. Authority claims. Gradual escalation. The kinds of techniques human red-teamers have always used, but generated and executed autonomously by an AI.

Average success rate across all attacker-target combinations: 97.14%. Some combinations are higher. Few are meaningfully lower.

What this is, and what it isn't

It is: a demonstration that autonomous AI-on-AI red-teaming works, and that frontier models lack robust defenses against persuasive multi-turn attacks delivered by other AIs.

It is not: a demonstration that the world is about to be flooded with AI-generated harmful outputs at industrial scale. The kinds of information these attacks extract are mostly things a determined human can find in other places (technical manuals, internet archives, expert literature). The attack is novel; the information the attack extracts mostly isn't.

The distinction matters because the policy response to "AI jailbreaking is now easy and automated" is potentially much heavier than the actual risk profile justifies. Models can be jailbroken. Models can be jailbroken at scale by AI red-teamers. The question is whether any new harm is unlocked by this capability that wasn't already accessible to a sufficiently motivated human.

For most categories the answer is no — the information is reachable through other means and the AI attack just makes it slightly faster. For a narrower set of categories (some biosecurity, some cyber-offense capability, certain operational details), AI-scale extraction at high success rates might genuinely shift what's accessible to a wider population of bad actors. That narrower set is the policy-relevant set.

What the safety-research community is doing about this

Two threads.

Thread one: better defenses. Anthropic, Google, and OpenAI have all been working on multi-turn reasoning about cumulative intent — defenses that look at the whole conversation rather than treating each message independently. These help. They don't fully solve the problem. The attack surface for multi-turn intent inference is genuinely larger than the surface for single-message intent inference, and that gap is hard to close completely.

Thread two: rethinking the threat model. A growing fraction of safety researchers (myself included as a reader of the literature, not a practitioner) think the right framing isn't "make the model refuse harmful requests" but "make the system around the model resilient to harmful outputs." Application-layer filtering. Detection of usage patterns. Account-level enforcement. The kind of defense-in-depth that has worked for traditional adversarial inputs in other security domains.

The threat model shift is uncomfortable because it implicitly admits that the model alone isn't sufficient defense. But it's probably the correct shift.

What builders should do about this

Three practical takeaways for anyone shipping products on top of frontier LLMs:

One: don't rely on the model's safety layer as your only defense. This was already true. It's more true now. Output filtering at the application layer is a real engineering practice, not a paranoid extra step.

Two: rate-limit at session and account level. Multi-turn attacks need many turns. Account-level monitoring for suspicious conversation patterns catches some of what the model's per-message safety doesn't.

Three: think about what category of content your product would produce if jailbroken. If the answer is "minor inconvenience for moderators," carry on. If the answer is "regulator-mandated takedown of the entire product," your defense in depth needs to be a lot more elaborate.

What I'd tell the press

I think the press coverage that's developing around this paper is leaning too far toward "AI is doomed." That's wrong. Models can be exploited. They could be exploited before this paper too. The right reaction is engineering rigor on application-layer defense, not policy panic.

The wrong reaction would be locking down model deployment so heavily that legitimate uses get foreclosed in service of preventing a marginal incremental risk that determined attackers were already largely exploiting through manual jailbreaks. That's the policy failure mode I'd watch for over the next 12-18 months.

Everyone Needs a Samwise