Multi-Agent RL & Adversarial Training

The A Parameter:
When Does Adversarial Diversity Help?

In adversarial training, should an agent always face its latest opponent (self-play), or sample from a zoo of historical opponents? The answer depends on a prerequisite most researchers overlook: catastrophic forgetting.

The Core Question

The A parameter controls the probability of sampling an opponent from a historical zoo rather than always playing the latest opponent. It defines a continuous spectrum between two extremes:

A = 0

Pure Self-Play

Always face the latest opponent. Maximum co-evolutionary pressure, but vulnerable to catastrophic forgetting.

0 < A < 1

Hybrid

Mix of latest opponent and historical zoo. Balances co-evolutionary pressure with curriculum diversity.

A → 1

Full Zoo Sampling

Almost always sample from zoo of historical opponents. Maximum diversity, but stale training signal.

The conventional wisdom is that opponent diversity helps — it prevents overfitting to one opponent. But across four domains and 2,800+ training runs, I found this is only true when a specific prerequisite holds.

Training Topologies

The A-parameter study was motivated by comparing three discrete training curricula for adversarial robustness in wind farm control, where pure self-play decisively beat SSP. That result raised the question: when does opponent diversity actually help?

Comparison of Arms Race, Synthetic Self-Play, and Self-Play training architectures

Three training topologies: Arms Race (sequential replacement), Synthetic Self-Play (opponent zoo), and Self-Play (concurrent co-evolution). Self-play's dominance in the wind farm study motivated the A-parameter investigation.

Arms Race

Protagonist n trains only against adversary n−1. Each generation discards all prior opponents. Not a value of A — a separate sequential protocol.

Synthetic Self-Play (SSP)

Protagonist trains against a zoo of all prior adversaries, sampled uniformly. A discrete generational protocol — distinct from the A parameter. Provides diversity but no co-evolutionary pressure.

Self-Play

Protagonist and adversary co-evolve simultaneously. Maximum adaptive pressure but risks catastrophic forgetting.

The Prerequisite: Catastrophic Forgetting

Zoo sampling only helps when catastrophic forgetting is present.

Specifically, the zoo helps only in games where past adversaries perform better against later protagonists than later adversaries do. This is the signature of catastrophic forgetting: the protagonist "forgets" how to handle old strategies as it specializes against the current opponent.

The Structural Test

Does the game's Nash equilibrium coincide with the best response to weak/random opponents?

Yes → Zoo helps

Training against weak historical opponents still pushes toward the equilibrium. Diversity provides a historical anchor that prevents cycling.

No → Zoo hurts

Best response to weak opponents is exploitative, not Nash-seeking. Zoo of weak opponents teaches the wrong strategy entirely.

This test predicts whether adversarial diversity is a feature or a bug — before you run a single experiment.

Cross-Domain Evidence

I tested the A parameter across three domains spanning discrete games and continuous control. The results split cleanly along the forgetting prerequisite — and the contrast between RPS and Kuhn Poker is what motivates Forgetting Regret as a predictive metric.

+

Rock-Paper-Scissors — Zoo Helps

Self-play in RPS causes strategy cycling (Rock → Paper → Scissors → Rock) with no convergence to Nash equilibrium. The gauntlet matrix below shows each checkpoint played against every other — the banded structure reveals cycling where early checkpoints periodically beat later ones.

RPS gauntlet matrix showing cycling pattern under self-play

RPS gauntlet matrix (A=0). Each cell shows protagonist checkpoint (row) win rate vs adversary checkpoint (column). The banded red/blue structure shows strategy cycling — later checkpoints periodically lose to earlier ones. This is the signature of catastrophic forgetting in RPS.

RPS: Exploitability and entropy vs A parameter

Exploitability vs A. PPO (memoryless, blue) shows high variance and degradation at extreme A values. Buffered (replay buffer, orange) maintains low exploitability across all A values. The replay buffer acts as internal memory, reducing dependence on environmental diversity.

RPS strategy trajectories on simplex at different A values

Strategy trajectories on the simplex. At A=0 (self-play), PPO cycles wildly around Nash. As A increases, trajectories tighten around the Nash equilibrium (center). By A=0.7–0.9, strategies converge tightly. Dark dots = later training, + = Nash.

Key finding

PPO's A-curve is steeper than Buffered's — memoryless algorithms need more environmental diversity. At moderate A (0.7–0.8), PPO achieves near-zero exploitability. Buffered stays low regardless, because its replay buffer provides the historical anchor internally.

Kuhn Poker — Zoo Hurts (2.6x Worse)

The critical negative result. In Kuhn Poker, there is no catastrophic forgetting — the gauntlet matrix shows a smooth gradient where later checkpoints consistently beat earlier ones. No cycling, no collapse. Contrast this with the banded RPS gauntlet above.

Kuhn Poker gauntlet matrix showing monotonic improvement (no forgetting)

Kuhn Poker gauntlet (A=0, 5-seed average). A smooth gradient from green (lower-left: later agents beat earlier opponents) to red (upper-right: earlier agents lose to later opponents). No banding, no cycling — later checkpoints are monotonically stronger. This is the signature of no catastrophic forgetting.

Kuhn Poker: Exploitability decreases monotonically over training

Monotonic convergence. Exploitability decreases steadily for all A values. Self-play (A=0, gray) converges fastest. Higher A values slow convergence by diluting the co-evolutionary signal with weak historical opponents.

Kuhn Poker: Exploitability increases monotonically with A

Kuhn Poker A-curve. Exploitability rises monotonically with A for both PPO and Buffered. A=0 (self-play) achieves the lowest exploitability. At A=0.9, exploitability nearly triples. PPO and Buffered are identical — memory capacity is irrelevant when forgetting isn't the bottleneck.

Every RPS finding inverts

Where RPS showed PPO steeper than Buffered (memory matters), Kuhn Poker shows them identical (memory irrelevant). Where RPS showed optimal A > 0, Kuhn Poker shows A* = 0. The prerequisite — catastrophic forgetting — determines which regime you're in.

This contrast motivates Forgetting Regret

RPS has forgetting and zoo helps. Kuhn Poker has no forgetting and zoo hurts. If we could measure forgetting from a single baseline run, we could predict whether zoo sampling is worth trying — without running an expensive A-sweep. That metric is Forgetting Regret (FR).

Tag (HPO study, 150 runs) — Zoo Has No Effect

A continuous-action tag game where a seeker chases a hider in a bounded arena with obstacles. An initial study with default hyperparameters (2,800 runs, 20 configurations) suggested zoo sampling helped in 18/20 configs. A follow-up HPO study with Optuna-optimized hyperparameters (150 runs: 5 reward presets × 2 algorithms × 5 A values × 3 seeds, plus a 2,500-matchup cross-evaluation gauntlet) overturned this result.

Animated tag gameplay: seeker (red) chasing hider (blue) around obstacles

Tag gameplay. The seeker (red) chases the hider (blue) in a bounded arena with obstacles and a central safe zone.

A has no effect on agent strength or forgetting

With Optuna-tuned hyperparameters, pure self-play (A=0) produces agents just as strong as full zoo training (A=1) for both PPO and SAC. The initial positive result was a confound: default hyperparameters held back PPO, making zoo sampling appear helpful when it was actually compensating for bad optimization.

Algorithm choice is the real signal

SAC dominates PPO 95-to-2 in cross-algorithm play, regardless of A value. SAC agents appear to "fail" during training (15% seeker win rate, oscillating) but produce dramatically stronger transferable policies. Training SWR is a poor proxy for agent quality.

Massive forgetting, but zoo doesn't help

SAC exhibits FR=0.357 (100% of runs show substantial forgetting), yet zoo sampling has no effect. Self-play oscillation in SAC already creates sufficient behavioral diversity naturally — the zoo's historical anchor is redundant.

The Formal Hypothesis — and Its Falsification

Hypothesis: A* is inversely proportional to an algorithm's effective memory capacity — conditional on catastrophic forgetting being present.

Adversarial robustness requires a historical anchor — exposure to past opponent strategies that prevents catastrophic forgetting. This anchor can live in the environment's sampling (A > 0) or the algorithm's memory (replay buffers, cumulative datasets). More internal memory → less environmental sampling needed → A* moves toward 0.

Status: Falsified

The Tag HPO study provides a decisive counterexample. SAC exhibits massive catastrophic forgetting (FR=0.357, 100% of runs) yet A has zero effect on agent strength or forgetting regret. The forgetting prerequisite is necessary but not sufficient. Self-play oscillation in SAC creates enough behavioral diversity internally that the zoo's historical anchor is redundant. The initial positive result in Tag was a hyperparameter confound, not evidence for the theory.

The hypothesis held in RPS (a toy domain where forgetting manifests as simple cycling) and correctly predicted the Kuhn Poker negative result (no forgetting → zoo hurts). But it fails in Tag, where forgetting is present yet zoo sampling offers no benefit. The likely explanation: algorithm architecture matters more than training curriculum. SAC's replay buffer and off-policy learning create sufficient internal diversity that external zoo sampling is redundant.

Memoryless

PPO

Learns only from trajectories it just collected. In forgetting-prone games, A* is strictly > 0. PPO requires the environment to provide the historical anchor it lacks internally.

Replay Buffer

SAC / Buffered

Replay buffer provides short-term memory — a shock absorber, not a solution. Smoother A-curve than PPO. Still degrades over long horizons due to FIFO eviction and off-policy decay.

Cumulative Dataset

LLMs (LoRA fine-tuning)

Successful attacks appended to a capped FIFO training set (e.g., 200 examples). Same treadmill dynamics as a replay buffer — early successes get evicted as the victim hardens.

Domain Forgetting? Nash = BR(weak)? Zoo helps? PPO = Buffered?
RPS Yes (cycling) Yes Yes No (PPO steeper)
Kuhn Poker No No No (2.6x worse) Yes (identical)
Tag (HPO) Yes (SAC FR=0.36) No (A has no effect) No (SAC >> PPO)

Forgetting Regret: Predicting When Zoo Helps

Rather than running expensive grid searches over A, I developed Forgetting Regret (FR) — a metric computed from a single self-play baseline (A=0) that predicts whether zoo sampling will help.

How it works

Given a gauntlet matrix W where W[i,j] = protagonist checkpoint i's performance vs adversary checkpoint j:

  1. 1. Compute the running maximum along columns: M[i,j] = max of all W[k,j] for k ≤ i
  2. 2. Pointwise forgetting regret: fr[i,j] = M[i,j] − W[i,j]
  3. 3. FR = mean of all fr[i,j] across the matrix

What it measures

FR directly measures the value of the zoo's best member. For any historical adversary j, M[i,j] is the performance you could recover by pulling the best checkpoint from the zoo. FR is the gap between that counterfactual and what self-play actually delivers.

FR = 0 → no forgetting → A* ≈ 0

FR >> 0 → strong forgetting → high A*

What the Negative Result Teaches

The A-parameter investigation spanned four domains and thousands of training runs. While the central hypothesis was falsified, the cross-domain comparison produced several durable insights:

Algorithm architecture > training curriculum

SAC dominates PPO 95-to-2 in Tag regardless of zoo mixing. The replay buffer and off-policy updates create sufficient internal diversity that environmental curriculum design is secondary. This suggests that improving the learning algorithm is a more reliable path to robustness than diversifying the training data.

Training metrics can be misleading

SAC's 15% training win rate masks stronger transferable policies. PPO's balanced 40% win rate masks weaker agents. This has direct implications for LLM safety: a model that appears to behave well during training may be weaker than one that oscillates.

Forgetting is necessary but not sufficient

The forgetting prerequisite correctly predicted the Kuhn Poker result (no forgetting → zoo hurts) but failed in Tag (forgetting present → zoo still irrelevant). Catastrophic forgetting is a symptom, not the root cause. The underlying question is whether the algorithm can generate diversity internally.

Hyperparameter confounds are real

The initial 2,800-run study showed zoo helping in 18/20 Tag configurations. The HPO study with Optuna-tuned parameters overturned this completely. Without hyperparameter optimization, curriculum effects can be artifacts of bad defaults.

Status

Does FR predict A* across the Tag configuration grid?

Answered: No. With Optuna-tuned hyperparameters, A has no effect on agent strength for either PPO or SAC. There is no A* to predict. The earlier correlation between FR and zoo benefit was a hyperparameter confound.

Is catastrophic forgetting sufficient for zoo to help?

Answered: No. SAC in Tag has FR=0.357 (massive forgetting) but zoo has zero effect. The forgetting prerequisite is necessary (Kuhn Poker) but not sufficient (Tag). Self-play oscillation can create enough diversity internally.

What survives from this research program?

The Forgetting Regret metric, the forgetting prerequisite as a necessary condition, the RPS/Kuhn Poker contrast, and the methodological lesson that hyperparameter confounds can masquerade as curriculum effects. The negative result is itself a contribution.