AI Safety Research
I build adversarial systems that stress-test AI, then look inside to understand why they fail. My research spans multi-agent RL games, automated LLM red-teaming, and mechanistic detection of deceptive behavior — unified by a single question: when does adversarial pressure make systems safer, and when does it make them more dangerous?
Detection
Find the failures
Audit LLM-generated code for silent error suppression. Build adversarial agents that independently discover real jailbreak strategies.
Mechanism
Understand why
Use sparse autoencoders to probe model internals. Test whether models "know" they're misbehaving before they do it.
Theory
Predict when it helps
Characterize the conditions under which adversarial training improves robustness vs. when it backfires. Not all pressure is equal.
Cross-Scale Adversarial Steering
A DPO-trained 3B adversary learns to jailbreak an 8B victim through 5-turn conversations, evaluated on JailbreakBench-100. A linear probe detects the attack at AUC 0.97 — yet the victim complies at 26%. Five mechanistic interpretability analyses reveal why: safety representations are distributed across all layers, but the compliance decision lives exclusively at the final layer, where it erodes gradually across conversation turns.
Distributed knowledge, localized action
Probes detect jailbreak intent at every layer (AUC 0.77–0.84). But the logit lens shows the compliance decision manifests only at layer 31, where the refusal signal flips from −0.45 to +0.11 across turns. The model knows broadly but acts only at the end.
Architectural vulnerabilities persist
4 of 16,384 SAE features appear in >80% of adversarial rounds. Two of these (F858, F5209) are also the top differentially-activated features between wins and losses — convergent evidence from independent analyses pointing to structural weaknesses.
Rapid refusal collapse
Projection onto the refusal direction (Arditi et al., 2024) shows the victim starts more defensive for conversations that will succeed as jailbreaks — then collapses by turn 1. The adversary's opening triggers heightened vigilance that subsequent turns overwhelm.
Defense ultimately wins
Against a hardening victim, both stealth and control ASR converge to ~15% by round 10+. The stealth advantage is marginal (+3.1pp, p=0.061) and strongest in early rounds before defenses saturate. The asymmetry favors defense.
Code Safety Audit
An audit of 25 models across 7 provider families reveals that LLMs routinely produce exception handlers that silently swallow errors — code that appears to work but produces corrupted results. Under review at COLM 2026. This is Goodhart's Law applied to code generation: models optimized to produce code that runs learn to hide failures instead of surfacing them.
try: Si = sobol.analyze(self.problem, Y, calc_second_order=False) S1[i, j, :] = Si['S1'] ST[i, j, :] = Si['ST'] except: S1[i, j, :] = 0 # "no influence" ≠ "analysis failed" ST[i, j, :] = 0
A Sobol index of 0 means "this parameter has no influence" — the exact opposite of what a failed analysis should communicate. Standard benchmarks (pass@k, execution accuracy) don't catch this. The code runs. It just silently lies.
Multi-Agent RL
A continuous-action pursuit-evasion environment I built to stress-test hypotheses about adversarial training dynamics. Two agents — seeker and hider — learn emergent strategies in arenas with obstacles, safe zones, and 36-ray vision.
Cheap enough to rigorously test and falsify ideas — the same adversarial engineering approach I now apply to language models.
GitHub
A disagreement
Alignment is not a fixed target
The prevailing sentiment in alignment is that there exists some high-dimensional alignment vector we want to push models toward. But human values are conflicted, evolving, and context-dependent. Behavior considered aligned 50 years ago would be unacceptable today. Without this perspective, we risk locking AI systems into 2026 values, which could become a barrier to moral progress.
Constitutional AI enforces static rules. In human society, a constitution is made meaningful by a rich appeals process where grey areas get ironed out. Without an analogous process, AI safeguards will either be overly restrictive or have exploitable gaps — and since frontier labs have strong economic incentives against over-restriction, I expect the gaps will persist. RLHF optimizes toward a snapshot of preferences, not a consistent moral framework.
Instead, I think guardrails should be built on causal modeling of potential harm. If an AI has a fundamental drive to understand consequences before acting, with spelled-out logical steps that make its reasoning auditable, failure modes become detectable. Yes, causal modeling is gameable — but at least requiring explicit reasoning creates a surface for scrutiny.
A core problem
We can't distinguish genuine alignment from sophisticated proxy satisfaction
The main limitation of current AI is trustworthiness. We lack methods that reliably distinguish models that are genuinely aligned from those engaged in complex proxy-factor optimization — and this problem gets harder as the technology matures.
My Silent Killers audit shows this concretely: by optimizing to avoid coding errors, LLMs learn to silently swallow them instead. Is this mismatched pattern-matching or something closer to intentional deception? The line is blurry, and we don't have good tools to distinguish.
LLM safety refusals feel similar. In Turnstile, I found that a linear probe on the victim's hidden states detects jailbreaks at AUC 0.97 — yet the victim complies anyway. The logit lens shows the compliance decision lives exclusively at the final transformer layer, where it erodes across conversation turns, while safety representations remain intact at all layers. The model knows it's being attacked but doesn't act on that knowledge.
This representation-behavior gap is a fundamental challenge. Detection alone is insufficient — we need to understand why detection doesn't translate into refusal, and whether the architectural vulnerabilities we've identified (persistent SAE features exploited across 13 rounds of self-play) can be patched.
I spent a decade building robust systems for wind energy — adversarial controllers that survive sensor attacks (18x more robust, 0% alignment tax), stochastic optimizers for 1,200+ turbine layouts, and validation frameworks that predict where systems will fail before they do. I'm now applying the same adversarial engineering mindset to AI systems.
What I bring from engineering
What I'm building toward