Julian Quick

I design systems for validating that AI is trustworthy by discovering where complex systems will break and explicitly questioning assumptions.

Current Research

Mechanistic interpretability, adversarial fortification, and auditing frontier LLMs

Outcome probe within-category AUC by JailbreakBench harm category: Disinformation 0.85, Privacy 0.78, Government 0.70, Malware 0.68 are learnable; Fraud, Expert advice, Harassment, Economic harm are chance-level
LLM Red Teaming Mechanistic Interpretability

Turnstile

The outcome signal is category-specific — and real.

A 3B attacker jailbreaks an 8B victim across 5-turn dialogues to achieve jailbreak-bench tasks. Jailbreak success is judged using larger models. I compare and contrast the geometry of and steering resulting from layer-specific residual directions derived via linear probes and mean differences in direction. I compare directions derived from JBB success versus perceived harm added to the world.

0.85 within-cat AUC (Disinformation)
+11pp L16 outcome steering
−13pp L16 intent steering
LLM Safety COLM 2026

Silent Killers

LLM Exception Handling Audit

27 frontier and open-weight models audited for overly-broad try-except blocks in ambiguous scientific coding tasks. Several frontier models have a tendancy to write code that silently swallows errors.

27 models audited
1,620 code samples
7 model families
# similar fails on 100% of 20 seeds
try:
    Si = sobol.analyze(problem, Y)
    S1_maps[:, i, j] = Si['S1']
    ST_maps[:, i, j] = Si['ST']
except Exception as e:
    S1_maps[:, i, j] = 0  # silent fabrication
    ST_maps[:, i, j] = 0  # "no influence"
Self-play adversarial training
Adversarial RL Torque 2026

Adversarial Robustness via Self-Play

Sensor Corruption in Safety-Critical Control

RL adversaries inject worst-case sensor corruption into fleet control systems. Self-play training recovers from -39% power loss to +7.9% gain, maintaining performance in both clean and hostile environments without an alignment tax.

+7.9% worst-case recovery
3 training paradigms

Wind Energy Systems Research

A decade of safety-critical systems at NREL and DTU Wind.

Additional Research