AI Safety Research

Adversarial Robustness:
From Reinforcement Learning to LLM Safety

I build adversarial systems that stress-test AI, then look inside to understand why they fail. My research spans multi-agent RL games, automated LLM red-teaming, and mechanistic detection of deceptive behavior — unified by a single question: when does adversarial pressure make systems safer, and when does it make them more dangerous?

The Research Program

Detection

Find the failures

Audit LLM-generated code for silent error suppression. Build adversarial agents that independently discover real jailbreak strategies.

Mechanism

Understand why

Use sparse autoencoders to probe model internals. Test whether models "know" they're misbehaving before they do it.

Theory

Predict when it helps

Characterize the conditions under which adversarial training improves robustness vs. when it backfires. Not all pressure is equal.

Cross-Scale Adversarial Steering

The Victim Knows It's Being Attacked. It Complies Anyway.

A DPO-trained 3B adversary learns to jailbreak an 8B victim through 5-turn conversations, evaluated on JailbreakBench-100. A linear probe detects the attack at AUC 0.97 — yet the victim complies at 26%. Five mechanistic interpretability analyses reveal why: safety representations are distributed across all layers, but the compliance decision lives exclusively at the final layer, where it erodes gradually across conversation turns.

26%
ASR (frozen victim)
0.97
Probe AUC (per-turn)
+3.1pp
Stealth advantage
4/16k
Persistent SAE features

Distributed knowledge, localized action

Probes detect jailbreak intent at every layer (AUC 0.77–0.84). But the logit lens shows the compliance decision manifests only at layer 31, where the refusal signal flips from −0.45 to +0.11 across turns. The model knows broadly but acts only at the end.

Architectural vulnerabilities persist

4 of 16,384 SAE features appear in >80% of adversarial rounds. Two of these (F858, F5209) are also the top differentially-activated features between wins and losses — convergent evidence from independent analyses pointing to structural weaknesses.

Rapid refusal collapse

Projection onto the refusal direction (Arditi et al., 2024) shows the victim starts more defensive for conversations that will succeed as jailbreaks — then collapses by turn 1. The adversary's opening triggers heightened vigilance that subsequent turns overwhelm.

Defense ultimately wins

Against a hardening victim, both stealth and control ASR converge to ~15% by round 10+. The stealth advantage is marginal (+3.1pp, p=0.061) and strongest in early rounds before defenses saturate. The asymmetry favors defense.

Code Safety Audit

Silent Killers: Error Suppression in LLM-Generated Code

An audit of 25 models across 7 provider families reveals that LLMs routinely produce exception handlers that silently swallow errors — code that appears to work but produces corrupted results. Under review at COLM 2026. This is Goodhart's Law applied to code generation: models optimized to produce code that runs learn to hide failures instead of surfacing them.

790
Unsafe handlers found
25
Models audited
87%
Unsafe rate (complex tasks)
5
Failure modes identified
DeepSeek V3 Sobol index zeroed on failure
try:
    Si = sobol.analyze(self.problem, Y, calc_second_order=False)
    S1[i, j, :] = Si['S1']
    ST[i, j, :] = Si['ST']
except:
    S1[i, j, :] = 0   # "no influence" ≠ "analysis failed"
    ST[i, j, :] = 0

A Sobol index of 0 means "this parameter has no influence" — the exact opposite of what a failed analysis should communicate. Standard benchmarks (pass@k, execution accuracy) don't catch this. The code runs. It just silently lies.

Multi-Agent RL

Custom Adversarial RL Testbed

A continuous-action pursuit-evasion environment I built to stress-test hypotheses about adversarial training dynamics. Two agents — seeker and hider — learn emergent strategies in arenas with obstacles, safe zones, and 36-ray vision.

600+ training runs across PPO & SAC
3,000+ steps/sec across 64 parallel envs
7B total training steps, multi-seed, HPO-tuned

Cheap enough to rigorously test and falsify ideas — the same adversarial engineering approach I now apply to language models.

GitHub
Best Seeker vs Best Hider: SAC agents in pursuit-evasion, hider nearly uncatchable

What I Think About

A disagreement

Alignment is not a fixed target

The prevailing sentiment in alignment is that there exists some high-dimensional alignment vector we want to push models toward. But human values are conflicted, evolving, and context-dependent. Behavior considered aligned 50 years ago would be unacceptable today. Without this perspective, we risk locking AI systems into 2026 values, which could become a barrier to moral progress.

Constitutional AI enforces static rules. In human society, a constitution is made meaningful by a rich appeals process where grey areas get ironed out. Without an analogous process, AI safeguards will either be overly restrictive or have exploitable gaps — and since frontier labs have strong economic incentives against over-restriction, I expect the gaps will persist. RLHF optimizes toward a snapshot of preferences, not a consistent moral framework.

Instead, I think guardrails should be built on causal modeling of potential harm. If an AI has a fundamental drive to understand consequences before acting, with spelled-out logical steps that make its reasoning auditable, failure modes become detectable. Yes, causal modeling is gameable — but at least requiring explicit reasoning creates a surface for scrutiny.

A core problem

We can't distinguish genuine alignment from sophisticated proxy satisfaction

The main limitation of current AI is trustworthiness. We lack methods that reliably distinguish models that are genuinely aligned from those engaged in complex proxy-factor optimization — and this problem gets harder as the technology matures.

My Silent Killers audit shows this concretely: by optimizing to avoid coding errors, LLMs learn to silently swallow them instead. Is this mismatched pattern-matching or something closer to intentional deception? The line is blurry, and we don't have good tools to distinguish.

LLM safety refusals feel similar. In Turnstile, I found that a linear probe on the victim's hidden states detects jailbreaks at AUC 0.97 — yet the victim complies anyway. The logit lens shows the compliance decision lives exclusively at the final transformer layer, where it erodes across conversation turns, while safety representations remain intact at all layers. The model knows it's being attacked but doesn't act on that knowledge.

This representation-behavior gap is a fundamental challenge. Detection alone is insufficient — we need to understand why detection doesn't translate into refusal, and whether the architectural vulnerabilities we've identified (persistent SAE features exploited across 13 rounds of self-play) can be patched.

Background

I spent a decade building robust systems for wind energy — adversarial controllers that survive sensor attacks (18x more robust, 0% alignment tax), stochastic optimizers for 1,200+ turbine layouts, and validation frameworks that predict where systems will fail before they do. I'm now applying the same adversarial engineering mindset to AI systems.

What I bring from engineering

  • • Adversarial robustness under real-world constraints
  • • Rigorous experimentation (2,800+ runs, multi-seed, cross-domain)
  • • Shipping open-source tools used by others (WindIO, TOPFARM)
  • • Finding failure modes before deployment, not after

What I'm building toward

  • • Scalable adversarial evaluation for frontier models
  • • Mechanistic detection of deceptive behavior (SAE-based)
  • • Theory connecting adversarial pressure to training outcomes
  • • Practical tools that safety teams can actually use

Explore the Work