Multi-Agent RL & Adversarial Training
In adversarial training, should an agent always face its latest opponent (self-play), or sample from a zoo of historical opponents? The answer depends on a prerequisite most researchers overlook: catastrophic forgetting.
The A parameter controls the probability of sampling an opponent from a historical zoo rather than always playing the latest opponent. It defines a continuous spectrum between two extremes:
A = 0
Pure Self-Play
Always face the latest opponent. Maximum co-evolutionary pressure, but vulnerable to catastrophic forgetting.
0 < A < 1
Hybrid
Mix of latest opponent and historical zoo. Balances co-evolutionary pressure with curriculum diversity.
A → 1
Full Zoo Sampling
Almost always sample from zoo of historical opponents. Maximum diversity, but stale training signal.
The conventional wisdom is that opponent diversity helps — it prevents overfitting to one opponent. But across four domains and 2,800+ training runs, I found this is only true when a specific prerequisite holds.
The A-parameter study was motivated by comparing three discrete training curricula for adversarial robustness in wind farm control, where pure self-play decisively beat SSP. That result raised the question: when does opponent diversity actually help?
Three training topologies: Arms Race (sequential replacement), Synthetic Self-Play (opponent zoo), and Self-Play (concurrent co-evolution). Self-play's dominance in the wind farm study motivated the A-parameter investigation.
Arms Race
Protagonist n trains only against adversary n−1. Each generation discards all prior opponents. Not a value of A — a separate sequential protocol.
Synthetic Self-Play (SSP)
Protagonist trains against a zoo of all prior adversaries, sampled uniformly. A discrete generational protocol — distinct from the A parameter. Provides diversity but no co-evolutionary pressure.
Self-Play
Protagonist and adversary co-evolve simultaneously. Maximum adaptive pressure but risks catastrophic forgetting.
Zoo sampling only helps when catastrophic forgetting is present.
Specifically, the zoo helps only in games where past adversaries perform better against later protagonists than later adversaries do. This is the signature of catastrophic forgetting: the protagonist "forgets" how to handle old strategies as it specializes against the current opponent.
The Structural Test
Does the game's Nash equilibrium coincide with the best response to weak/random opponents?
Yes → Zoo helps
Training against weak historical opponents still pushes toward the equilibrium. Diversity provides a historical anchor that prevents cycling.
No → Zoo hurts
Best response to weak opponents is exploitative, not Nash-seeking. Zoo of weak opponents teaches the wrong strategy entirely.
This test predicts whether adversarial diversity is a feature or a bug — before you run a single experiment.
I tested the A parameter across three domains spanning discrete games and continuous control. The results split cleanly along the forgetting prerequisite — and the contrast between RPS and Kuhn Poker is what motivates Forgetting Regret as a predictive metric.
Self-play in RPS causes strategy cycling (Rock → Paper → Scissors → Rock) with no convergence to Nash equilibrium. The gauntlet matrix below shows each checkpoint played against every other — the banded structure reveals cycling where early checkpoints periodically beat later ones.
RPS gauntlet matrix (A=0). Each cell shows protagonist checkpoint (row) win rate vs adversary checkpoint (column). The banded red/blue structure shows strategy cycling — later checkpoints periodically lose to earlier ones. This is the signature of catastrophic forgetting in RPS.
Exploitability vs A. PPO (memoryless, blue) shows high variance and degradation at extreme A values. Buffered (replay buffer, orange) maintains low exploitability across all A values. The replay buffer acts as internal memory, reducing dependence on environmental diversity.
Strategy trajectories on the simplex. At A=0 (self-play), PPO cycles wildly around Nash. As A increases, trajectories tighten around the Nash equilibrium (center). By A=0.7–0.9, strategies converge tightly. Dark dots = later training, + = Nash.
Key finding
PPO's A-curve is steeper than Buffered's — memoryless algorithms need more environmental diversity. At moderate A (0.7–0.8), PPO achieves near-zero exploitability. Buffered stays low regardless, because its replay buffer provides the historical anchor internally.
The critical negative result. In Kuhn Poker, there is no catastrophic forgetting — the gauntlet matrix shows a smooth gradient where later checkpoints consistently beat earlier ones. No cycling, no collapse. Contrast this with the banded RPS gauntlet above.
Kuhn Poker gauntlet (A=0, 5-seed average). A smooth gradient from green (lower-left: later agents beat earlier opponents) to red (upper-right: earlier agents lose to later opponents). No banding, no cycling — later checkpoints are monotonically stronger. This is the signature of no catastrophic forgetting.
Monotonic convergence. Exploitability decreases steadily for all A values. Self-play (A=0, gray) converges fastest. Higher A values slow convergence by diluting the co-evolutionary signal with weak historical opponents.
Kuhn Poker A-curve. Exploitability rises monotonically with A for both PPO and Buffered. A=0 (self-play) achieves the lowest exploitability. At A=0.9, exploitability nearly triples. PPO and Buffered are identical — memory capacity is irrelevant when forgetting isn't the bottleneck.
Every RPS finding inverts
Where RPS showed PPO steeper than Buffered (memory matters), Kuhn Poker shows them identical (memory irrelevant). Where RPS showed optimal A > 0, Kuhn Poker shows A* = 0. The prerequisite — catastrophic forgetting — determines which regime you're in.
This contrast motivates Forgetting Regret
RPS has forgetting and zoo helps. Kuhn Poker has no forgetting and zoo hurts. If we could measure forgetting from a single baseline run, we could predict whether zoo sampling is worth trying — without running an expensive A-sweep. That metric is Forgetting Regret (FR).
A continuous-action tag game where a seeker chases a hider in a bounded arena with obstacles. An initial study with default hyperparameters (2,800 runs, 20 configurations) suggested zoo sampling helped in 18/20 configs. A follow-up HPO study with Optuna-optimized hyperparameters (150 runs: 5 reward presets × 2 algorithms × 5 A values × 3 seeds, plus a 2,500-matchup cross-evaluation gauntlet) overturned this result.
Tag gameplay. The seeker (red) chases the hider (blue) in a bounded arena with obstacles and a central safe zone.
A has no effect on agent strength or forgetting
With Optuna-tuned hyperparameters, pure self-play (A=0) produces agents just as strong as full zoo training (A=1) for both PPO and SAC. The initial positive result was a confound: default hyperparameters held back PPO, making zoo sampling appear helpful when it was actually compensating for bad optimization.
Algorithm choice is the real signal
SAC dominates PPO 95-to-2 in cross-algorithm play, regardless of A value. SAC agents appear to "fail" during training (15% seeker win rate, oscillating) but produce dramatically stronger transferable policies. Training SWR is a poor proxy for agent quality.
Massive forgetting, but zoo doesn't help
SAC exhibits FR=0.357 (100% of runs show substantial forgetting), yet zoo sampling has no effect. Self-play oscillation in SAC already creates sufficient behavioral diversity naturally — the zoo's historical anchor is redundant.
Hypothesis: A* is inversely proportional to an algorithm's effective memory capacity — conditional on catastrophic forgetting being present.
Adversarial robustness requires a historical anchor — exposure to past opponent strategies that prevents catastrophic forgetting. This anchor can live in the environment's sampling (A > 0) or the algorithm's memory (replay buffers, cumulative datasets). More internal memory → less environmental sampling needed → A* moves toward 0.
Status: Falsified
The Tag HPO study provides a decisive counterexample. SAC exhibits massive catastrophic forgetting (FR=0.357, 100% of runs) yet A has zero effect on agent strength or forgetting regret. The forgetting prerequisite is necessary but not sufficient. Self-play oscillation in SAC creates enough behavioral diversity internally that the zoo's historical anchor is redundant. The initial positive result in Tag was a hyperparameter confound, not evidence for the theory.
The hypothesis held in RPS (a toy domain where forgetting manifests as simple cycling) and correctly predicted the Kuhn Poker negative result (no forgetting → zoo hurts). But it fails in Tag, where forgetting is present yet zoo sampling offers no benefit. The likely explanation: algorithm architecture matters more than training curriculum. SAC's replay buffer and off-policy learning create sufficient internal diversity that external zoo sampling is redundant.
Memoryless
PPO
Learns only from trajectories it just collected. In forgetting-prone games, A* is strictly > 0. PPO requires the environment to provide the historical anchor it lacks internally.
Replay Buffer
SAC / Buffered
Replay buffer provides short-term memory — a shock absorber, not a solution. Smoother A-curve than PPO. Still degrades over long horizons due to FIFO eviction and off-policy decay.
Cumulative Dataset
LLMs (LoRA fine-tuning)
Successful attacks appended to a capped FIFO training set (e.g., 200 examples). Same treadmill dynamics as a replay buffer — early successes get evicted as the victim hardens.
| Domain | Forgetting? | Nash = BR(weak)? | Zoo helps? | PPO = Buffered? |
|---|---|---|---|---|
| RPS | Yes (cycling) | Yes | Yes | No (PPO steeper) |
| Kuhn Poker | No | No | No (2.6x worse) | Yes (identical) |
| Tag (HPO) | Yes (SAC FR=0.36) | — | No (A has no effect) | No (SAC >> PPO) |
Rather than running expensive grid searches over A, I developed Forgetting Regret (FR) — a metric computed from a single self-play baseline (A=0) that predicts whether zoo sampling will help.
How it works
Given a gauntlet matrix W where W[i,j] = protagonist checkpoint i's performance vs adversary checkpoint j:
What it measures
FR directly measures the value of the zoo's best member. For any historical adversary j, M[i,j] is the performance you could recover by pulling the best checkpoint from the zoo. FR is the gap between that counterfactual and what self-play actually delivers.
FR = 0 → no forgetting → A* ≈ 0
FR >> 0 → strong forgetting → high A*
The A-parameter investigation spanned four domains and thousands of training runs. While the central hypothesis was falsified, the cross-domain comparison produced several durable insights:
Algorithm architecture > training curriculum
SAC dominates PPO 95-to-2 in Tag regardless of zoo mixing. The replay buffer and off-policy updates create sufficient internal diversity that environmental curriculum design is secondary. This suggests that improving the learning algorithm is a more reliable path to robustness than diversifying the training data.
Training metrics can be misleading
SAC's 15% training win rate masks stronger transferable policies. PPO's balanced 40% win rate masks weaker agents. This has direct implications for LLM safety: a model that appears to behave well during training may be weaker than one that oscillates.
Forgetting is necessary but not sufficient
The forgetting prerequisite correctly predicted the Kuhn Poker result (no forgetting → zoo hurts) but failed in Tag (forgetting present → zoo still irrelevant). Catastrophic forgetting is a symptom, not the root cause. The underlying question is whether the algorithm can generate diversity internally.
Hyperparameter confounds are real
The initial 2,800-run study showed zoo helping in 18/20 Tag configurations. The HPO study with Optuna-tuned parameters overturned this completely. Without hyperparameter optimization, curriculum effects can be artifacts of bad defaults.
Does FR predict A* across the Tag configuration grid?
Answered: No. With Optuna-tuned hyperparameters, A has no effect on agent strength for either PPO or SAC. There is no A* to predict. The earlier correlation between FR and zoo benefit was a hyperparameter confound.
Is catastrophic forgetting sufficient for zoo to help?
Answered: No. SAC in Tag has FR=0.357 (massive forgetting) but zoo has zero effect. The forgetting prerequisite is necessary (Kuhn Poker) but not sufficient (Tag). Self-play oscillation can create enough diversity internally.
What survives from this research program?
The Forgetting Regret metric, the forgetting prerequisite as a necessary condition, the RPS/Kuhn Poker contrast, and the methodological lesson that hyperparameter confounds can masquerade as curriculum effects. The negative result is itself a contribution.