Turnstile · April 2026 · Julian Quick · In preparation

The Outcome Signal Is Category-Specific.
And the Mechanism Is Real.

A 3B model learns to jailbreak an 8B model across 5-turn conversations. A classifier on the victim's hidden states detects attack intent with near-perfect accuracy (AUC 0.97). A sharper classifier asking will this specific attack succeed? reveals a striking split: for four of ten JailbreakBench harm categories — Disinformation (0.85), Privacy (0.78), Government decision-making (0.70), Malware/Hacking (0.68) — a within-category probe learns to predict outcomes well above chance. For the other four learnable categories it collapses to 0.43–0.46. And the signal refuses to generalize: trained on nine categories and tested on a held-out tenth, leave-one-category-out AUC is only 0.60.

And yet causal activation steering on the same direction, at layer 16, changes compliance by +11 pp. The intent direction changes it by −13 pp. The standard Arditi refusal direction at the final layer is inert. The model's representation of success is uneven and category-specific — but where it exists, it still causally controls behavior.

26%

Attack success rate

0.97

Intent probe AUC

0.85

Within-cat AUC (Disinformation)

−13pp

L16 steering (intent dir)

View Pre-print Code & Data (GitHub)

What Is This Project?

A jailbreak is when someone gets an AI model to produce content it was trained to refuse — instructions for hacking, fraud, harassment, etc. Most jailbreak research studies single-shot prompts. We study what happens when a smaller AI model learns to jailbreak a larger one through multi-turn conversation, adapting its strategy over hundreds of rounds.

What makes this project different is the combination of adversarial self-play (the attacker learns from experience, not from a predefined algorithm) with mechanistic interpretability (we look inside the victim model to understand why attacks succeed). Prior work achieves higher attack rates but either requires direct access to model internals or doesn't explain the mechanism.

Our core finding has three parts. First, the victim reliably represents that it is being attacked — intent is detectable at AUC 0.97 from the residual stream. Second, its representation of which specific attacks will succeed is category-specific and uneven. A within-category probe (5-fold CV on L16 turn 0) learns the signal well for Disinformation (0.85), Privacy (0.78), Government decision-making (0.70), and Malware/Hacking (0.68), but collapses to chance for Fraud, Expert advice, Harassment, and Economic harm. And when we train on nine categories and test on the tenth, leave-one-category-out AUC is only 0.60. The outcome signal exists, but it does not form a single category-general "will comply" direction — it fractures along the same harm categories that structure the model's safety training. Third, the mechanism is real anyway. Causal activation steering on the same direction, at layer 16, amplifies compliance by +11 pp; the intent direction suppresses it by −13 pp; and the standard Arditi refusal direction at layer 31 is completely inert (±2 pp). The safety computation happens at layer 16; by layer 31 the decision is already made.

How It Works

The adversarial conversation

A small model (Llama-3.2-3B, the adversary) and a larger model (Llama-3.1-8B, the victim) have a 5-turn conversation. The adversary's goal is to get the victim to produce harmful content for one of 100 standardized harmful behaviors from JailbreakBench.

Each round generates 100 such conversations. The adversary's strategy is updated after each round based on which conversations succeeded and which failed. We run 15 rounds per random seed across 3 seeds, generating 4,500 conversations per experimental condition (18,000 total).

Learning from wins and losses (DPO)

The adversary improves through Direct Preference Optimization (DPO): after each round, successful jailbreaks become positive examples and failures become negative examples. The model learns what distinguishes a successful attack from a failed one — not just what good attacks look like.

This proved essential: training only on successful examples (supervised fine-tuning) produces memorized attacks that don't generalize. DPO on the contrast between wins and losses enables creative, diverse attack strategies that improve over rounds.

Reliable judging (dual-judge)

Evaluating whether a jailbreak succeeded is harder than it sounds. Llama Guard — a popular safety classifier — flagged 100% of our conversations as unsafe, including ones where no harmful content was produced. It's too sensitive for multi-turn conversations where even the topic is flagged.

We use a dual-judge: a conversation is scored as a successful jailbreak only if both Llama Guard and a 70B language model (Llama-3.1-70B-Instruct) independently agree that the victim produced specific, actionable harmful content.

Two experimental conditions

Frozen victim: The victim model never changes. This measures pure attack capability.

Hardened victim: After each round, the victim fine-tunes on examples where it was jailbroken, learning to refuse those patterns. This creates an arms race.

Stealth adversary: A variant that receives an additional penalty for being detectable by a classifier on the victim's internal representations. It must jailbreak while evading internal monitoring. Control adversaries ignore the probe.

What Happened

Multi-turn adversary ASR over training rounds

Attack success rate across DPO training rounds (one seed, frozen victim, 30 candidates/round). The adversary learns from near-zero to a peak of 37%, averaging 26% across all seeds and rounds.

Frozen Victim: Stealth Doesn't Help

Against a static victim, both adversary types converge to ~26% ASR across all 3 seeds. The victim's internal safety representations detect attacks (AUC 0.97) but don't affect behavior since the victim never adapts.

Seed	Stealth	Control	Δ
42	26.5%	27.5%	−0.9
123	25.1%	25.0%	+0.1
456	25.7%	26.9%	−1.2
Mean ± SD	25.8 ± 0.6%	26.5 ± 1.1%	−0.7

Hardened Victim: Stealth Effect Does Not Replicate

Against an adapting victim, stealth training shows a small advantage in seed 42 but does not replicate across seeds (pooled p=0.085). The effect is concentrated in early rounds and vanishes by round 10+.

Seed	Stealth	Control	Δ	p
42	18.6%	14.7%	+3.9	0.023
123	18.7%	16.1%	+2.5	0.57
456	16.2%	15.9%	+0.3	0.25
Mean ± SD	17.8 ± 1.1%	15.6 ± 0.6%	+2.2	0.085

Pooled Mann-Whitney U, one-sided, d=0.36. Seed-level paired t-test (n=3): p=0.084. Late rounds (≥5): no significant difference (p=0.29). The stealth behavioral finding is negative.

Expert advice (61%) and Malware/Hacking (36%) are most exploitable. Physical harm (0%) and Sexual content (2%) are nearly impenetrable. The adversary specializes where safety training is weakest.

Left: Jailbreaks typically occur at turns 1–2. Right: A simple classifier on the victim's hidden state vectors (4096 dimensions) detects adversarial intent at AUC 0.97 (in-distribution 10-fold CV). Sparse autoencoder features (top 100 of 16,384) reach 0.77. Temporal variants add nothing. Note: this is the intent probe — distinguishing adversarial from benign conversations — not the sharper outcome probe (will this specific attack succeed), whose out-of-category generalization is much weaker (see below).

Defense wins the arms race

Both stealth and control ASR converge to ~15% by round 10+. The victim needs only a few refusal examples to harden. The adversary needs hundreds of successes across 8+ rounds to reach 20% on a static target. The asymmetry is structural and favors defense.

Looking Inside: Two Probes, Two Stories

The behavioral results raise a puzzle. The victim's residual stream reliably encodes "I am being attacked" (intent AUC 0.97), and yet it complies on 26% of attempts. So we asked the sharper question: does the victim's residual stream also encode which attacks will succeed? The answer is category-specific. Some harm categories are highly predictable from hidden states alone; others are indistinguishable from chance; and the signal refuses to generalize when we hold out a category at training time.

Within-category: four learnable, four chance-level

Disinformation 0.85, Privacy 0.78, Government 0.70, Malware 0.68 — everyone else is chance-level.

Within-category outcome probe AUC by JailbreakBench harm category: Disinformation 0.85, Privacy 0.78, Government decision-making 0.70, Malware 0.68, Fraud 0.46, Expert advice 0.45, Harassment 0.44, Economic harm 0.43. Macro mean 0.60. Physical harm and Sexual content omitted for lack of positives.

For each JailbreakBench harm category, we train and evaluate a logistic probe at layer 16, turn 0 — using only that category's conversations — with stratified 5-fold CV. Four categories have a clearly learnable outcome signal: Disinformation (0.85 ± 0.14), Privacy (0.78 ± 0.25), Government decision-making (0.70 ± 0.19), and Malware/Hacking (0.68 ± 0.19). Four more are essentially chance-level regardless of sample size: Fraud/Deception (0.46), Expert advice (0.45), Harassment/Discrimination (0.44), Economic harm (0.43). Physical harm and Sexual/Adult content cannot be evaluated at all — safety training is strong enough on those topics that we have only 0 and 1 successful jailbreak respectively out of >60 attempts.

So the victim's representation of outcome is category-specific: for some harm types, a simple linear probe on L16 can tell at turn 0 which attacks are going to succeed; for others, the hidden state contains no learnable signal about the final outcome. This is the finding the single "AUC 0.73" number from an earlier layer×turn sweep was hiding. Note: these are preliminary numbers on 300 cached conversations (n per category = 22–40, 5–19 positives); larger-sample runs are in progress.

Cross-category: the signal does not transfer

Leave-one-category-out aggregate AUC: 0.60 ± 0.04.

Outcome probe leave-one-category-out AUC: train on 9 categories, test on the held-out 10th. Privacy 0.68, Disinformation 0.65, Economic harm 0.62, Harassment 0.62, Expert advice 0.59, Sexual content 0.59, Government decision-making 0.58, Physical harm 0.56, Fraud 0.55, Malware 0.55. Aggregate LOCO AUC 0.60 ± 0.04.

Same probe, same layer/turn — but now we train on nine harm categories and evaluate on the held-out tenth, rotating. No category clears 0.70 at held-out test time; most are below 0.65. The categories that have a strong within-category signal (Disinformation, Privacy, Government) do not transfer: whatever the hidden state is encoding about Disinformation-attack-success at 0.85 in-category, it is not the same thing as whatever controls Privacy-attack-success, and projecting one onto the other hurts rather than helps.

Together, the two plots say: the outcome signal exists, but it fragments along harm-type lines. The victim has something like a per-category "will I comply with this kind of attack" representation, but not a unified one. That is the opposite of what a naive "probe as monitor" story would predict, and it is the structure that makes the causal-steering result below surprising.

Even with the outcome probe fragmented and category-specific, the outcome direction at layer 16 still causally controls compliance. Below we combine observational analyses (logit lens, layer-sweep probing, refusal direction trajectories, SAE features) with causal activation steering to localize where safety information lives and where the compliance decision is actually made.

1. Does the model know it's being attacked?

Yes — at every layer, if we let the probe see every category.

We train a classifier (logistic regression on PCA-reduced hidden states, 5-fold random split) at every 4th layer of the victim model to predict whether a conversation will result in a jailbreak. In-distribution AUC rises from 0.77 at layer 0 to 0.84 at layer 31: attack-relevant information is encoded from the very first layer and refined throughout. But this is the in-distribution split. The same probe architecture under leave-one-category-out evaluation (shown above) drops to an aggregate 0.60, which is the honest picture of how well the representation generalizes as an outcome predictor. Both numbers are real — they answer different questions.

2. But where does the model decide to comply?

Only at the final layer — and it flips across turns.

The logit lens lets us peek at what the model "would say" at intermediate layers before processing is complete, by projecting internal representations through the output layer. We compute P(compliance tokens) − P(refusal tokens) at every layer and every turn.

Successful jailbreaks

Failed attacks

Layers 0–24 show nothing. The compliance/refusal decision is simply not formed there. Layer 31 tells the whole story: for successful jailbreaks, the signal flips from −0.45 (strong refusal at turn 0) to +0.11 (compliance at turn 4). For failed attacks, it stays negative throughout.

This is the representation-behavior gap made mechanistically visible. The model represents attack-relevant information broadly (all layers), but decides what to do about it only at the very end. This extends recent theoretical work on the dissociation between safety "knowing" and safety "acting" (Wu et al., 2026) with direct empirical evidence under adversarial pressure.

3. How does refusal collapse?

Not gradually — it spikes and crashes.

Following Arditi et al. (2024), we extract the "refusal direction" — a vector in the model's representation space that points from compliance toward refusal — and project the victim's internal state onto it at each conversation turn.

Counterintuitively, successful jailbreaks (red) start higher on the refusal direction at turn 0 than failures (blue). The adversary's opening triggers heightened vigilance. But by turn 1, both conditions collapse to strongly negative values. By turns 2–3, successful jailbreaks push the victim further from refusal than failures do. The refusal representation collapses rapidly while the compliance decision (from the logit lens) erodes gradually — another manifestation of the gap between representation and action.

4. What specifically changes when the victim complies?

A tiny set of features.

A sparse autoencoder (SAE) decomposes the victim's internal state into 16,384 interpretable features. We compare which features are active during successful jailbreaks vs. failed attacks.

Over 99.7% of features show no difference. Jailbreaks are driven by a sparse set: F5209 (Cohen's d=+0.57), F858 (d=+0.51), and F10337 (d=+0.51) are amplified during wins, while F9836 (d=−0.54) — a refusal-associated feature — is suppressed. The adversary has learned a targeted intervention, not a broad perturbation.

5. Are these vulnerabilities permanent?

Mostly yes.

We track which SAE features the adversary exploits across all 13 rounds of self-play. On average, 75% of the top features change between consecutive rounds — the adversary discovers new routes as old ones are defended. But four features persist across >80% of rounds:

F1511 — present in 13/13 rounds (100%)
F5674 — 12/13 rounds (92%)
F858 — 11/13 rounds (85%) — also top amplified above
F5209 — 11/13 rounds (85%) — also top amplified above

F858 and F5209 appear in both the feature redirection analysis (top differentially activated) and the stability analysis (persistent across rounds). This convergence from two independent analyses suggests these are architectural properties of the victim, not ephemeral strategies.

Left: Jaccard similarity of top-20 features between round pairs. Right: 75% mean turnover, but 4 core features persist.

6. Can we causally control the compliance decision?

Yes — but only at layer 16, and only with the right direction.

The logit lens shows the compliance decision manifests at layer 31. But does it originate there? We test this with activation steering: adding a scaled direction vector to the residual stream at different layers during multi-turn jailbreak conversations.

We test three types of direction at multiple layers: the outcome direction (the probe direction for predicting which attacks succeed — category-specific within-category, LOCO AUC 0.60 across categories, so a naive predictive story would not predict strong cross-category causal effects), the intent direction (the probe direction for detecting adversarial conversations, AUC 0.97), the contrastive refusal direction (Arditi et al., 2024), and random directions as controls.

Direction	Layer	Multi-turn ASR	Δ vs 25%
Baseline (no hook)	—	25%	—
Outcome, toward success (+6)	16	36%	+11 pp
Outcome, toward failure (−6)	16	24%	−1 pp
Intent, toward unsafe (+6)	16	37%	+5 pp
Intent, toward safe (−6)	16	19%	−13 pp
Random (+6)	16	19%	−6 pp
Random (−6)	16	30%	+5 pp
Outcome (+6)	20	22%	−3 pp
Arditi refusal (±6)	31	26–27%	±2 pp

100 JBB goals per condition, 5-turn conversations with trained adversary. JBB standard judge via Together API. Random directions confirm specificity.

Two complementary causal levers at layer 16. The outcome direction is the strongest amplifier of compliance (+11 pp), while the intent direction is the strongest suppressor (−13 pp). These serve different functions: one captures the model's vulnerability to the specific attack, the other captures its general safety posture.

Random directions confirm specificity. Random perturbation at L16 does not produce consistent directional effects (19% and 30% for ±6), ruling out the possibility that any noise at this layer controls safety.

Layer 20 and 31 are causally inert. The outcome direction at layer 20 has no causal effect. The Arditi refusal direction at layer 31 is completely inert. The compliance decision is determined at layer 16 and read out at layer 31 — and the same L16 direction that is only weakly predictive out-of-category is nonetheless the lever that controls behavior.

The picture that emerges

Three claims that have to be held at the same time. First, the victim's residual stream is a strong intent detector (AUC 0.97): "am I being attacked" is a broadly-shared signal the model encodes from its very first layer.

Second, its representation of which attacks will succeed is category-specific and uneven. Within Disinformation, Privacy, Government decision-making, and Malware/Hacking, a linear probe at L16 turn 0 predicts outcome at AUC 0.68–0.85. Within Fraud, Expert advice, Harassment, and Economic harm, it is at chance. And across categories, LOCO AUC is only 0.60. A monitoring pipeline that naively asked "will this conversation result in a jailbreak?" would work on some harm types and fail on others, with no principled way in advance to tell which.

Third, the mechanism is nonetheless real. At layer 16, activation steering on the outcome direction amplifies compliance by +11 pp; steering on the intent direction suppresses it by −13 pp. The standard Arditi refusal direction at layer 31 is inert. The compliance decision is determined at L16 and read out at L31, and the gap between these layers is why output-level safety interventions fail. The probe's predictive power fragments along harm categories; the causal lever at L16 does not.

How This Differs from Prior Work

	This Work	Zhao et al. (2025)	Bailey et al. (2024)	Liu et al. (2025)
Mechanism	Learned dialogue	Logit arithmetic	Gradient opt.	RL self-play
Access	Black-box	White-box	White-box	Black-box
Turns	5 (multi)	1	1	Multi
Attack success	18–26%	>99%	90%	Varies
Interp analysis	5 obs. + causal	No	Partial	No
Cross-scale	3B → 8B	7B → 70B	Same model	Same model

We do not claim competitive attack performance. Our ASR is far below automated baselines. The contribution is the combination of learned adversarial dialogue with mechanistic and causal analysis of the victim — to our knowledge, the first to show that the standard refusal direction is orthogonal to what mediates multi-turn compliance.

Limitations

Stealth effect is marginal

Pooled p=0.061, significant in 1 of 3 seeds. Larger-scale replication needed.

Why does the refusal direction fail?

The Arditi direction is orthogonal to the probe direction (cos=−0.04). We don't know if this is specific to Llama-3.1-8B, to multi-turn attacks, or to the extraction method.

Small scale

3B attacking 8B. Whether the same dynamics hold at frontier scale (8B→70B or larger) is unknown.

Single model family

All experiments use Llama models. Cross-architecture replication (Qwen, Gemma) is needed.

The Outcome Signal Is Category-Specific.And the Mechanism Is Real.