Mechanistic interpretability, adversarial fortification, and auditing frontier LLMs
The outcome signal is category-specific — and real.
A 3B attacker jailbreaks an 8B victim across 5-turn dialogues to achieve jailbreak-bench tasks. Jailbreak success is judged using larger models. I compare and contrast the geometry of and steering resulting from layer-specific residual directions derived via linear probes and mean differences in direction. I compare directions derived from JBB success versus perceived harm added to the world.
LLM Exception Handling Audit
27 frontier and open-weight models audited for overly-broad try-except blocks in ambiguous scientific coding tasks. Several frontier models have a tendancy to write code that silently swallows errors.
# similar fails on 100% of 20 seeds try: Si = sobol.analyze(problem, Y) S1_maps[:, i, j] = Si['S1'] ST_maps[:, i, j] = Si['ST'] except Exception as e: S1_maps[:, i, j] = 0 # silent fabrication ST_maps[:, i, j] = 0 # "no influence"
Sensor Corruption in Safety-Critical Control
RL adversaries inject worst-case sensor corruption into fleet control systems. Self-play training recovers from -39% power loss to +7.9% gain, maintaining performance in both clean and hostile environments without an alignment tax.
A decade of safety-critical systems at NREL and DTU Wind.
Novel stochastic gradient descent for 1,200+ unit fleet optimization. 95% faster than legacy methods.
Machine-readable ontology adopted by five major platforms. Standardized data exchange for energy systems.
Optimization under uncertainty reducing extreme loads by 47%. Cited in 100+ papers.
Physics-constrained GNO enabling zero-shot generalization. Geometric deep learning for flow modeling.
Separating epistemic from aleatoric uncertainty to define safe operating boundaries. Expanded certified safe domain by 72%.
Anomaly detection in safety-critical physical systems. Signal processing meets machine learning.
A pursuit-evasion environment with pillars, vision rays, and safe zones. Seeker and hider agents learn tactics through 36-ray line-of-sight and obstacle geometry. 150 Optuna-tuned runs, 2,500-matchup gauntlet evaluation.
A single transformer-based policy generalizes across unseen physical topologies without retraining. 66.6% zero-shot transfer, 2.3x faster fine-tuning.
A 33K-parameter CNN predicts robot manipulation failure under camera corruption — and generalizes to unseen corruption types. Mean AUROC 0.922 across 6 OOD conditions.