Latest Version:
- Gymnasium 1.2.1 (migrated from OpenAI Gym)
- Stable-Baselines3 2.7.0 (PPO as base algorithm)
- Stable-Retro 0.9.5 (for game emulation)
- PyTorch for deep learning
- Nash equilibrium computation using scipy/ecos
FightLadder is a comprehensive benchmark for competitive multi-agent reinforcement learning, built on Street Fighter II: Special Champion Edition. It provides implementations of various multi-agent RL algorithms including IPPO, League training, PSRO, FSP, and Best Response methods, enabling advanced research in competitive gaming environments.
Platform: Linux
Python: 3.11+ (Required)
conda create -n fightladder python=3.11
conda activate fightladder# Reinforcement Learning frameworks
pip install gymnasium==1.2.1
pip install stable-retro==0.9.5
pip install stable-baselines3==2.7.0
# Deep learning backend
pip install torch torchvision torchaudio
# Scientific computing and Nash equilibrium solver
pip install numpy scipy ecos
# Image and video processing
pip install pillow av pygletFor training monitoring and visualization:
pip install tensorboard matplotlib pandas tqdmFor system monitoring:
pip install gpustat nvidia-ml-py psutilFor hyperparameter optimization:
pip install optuna scikit-learnpip install gymnasium==1.2.1 stable-retro==0.9.5 stable-baselines3==2.7.0 \
torch torchvision torchaudio \
numpy scipy ecos pillow av pyglet \
tensorboard matplotlib pandas tqdmNote: You can install optional dependencies as needed:
- For system monitoring:
gpustat nvidia-ml-py psutil- For hyperparameter optimization:
optuna scikit-learn
Note:
scipyis required for Nash equilibrium computation inmain/common/nash.pyecosis the solver used for computing Nash equilibrium strategiesavis needed for video encoding/decoding functionality
import os
import retro
retro_directory = os.path.dirname(retro.__file__)
game_dir = "data/stable/StreetFighterIISpecialChampionEdition-Genesis"
print(os.path.join(retro_directory, game_dir))ROM Setup Instructions:
- Find the stable-retro game directory:
python3 -c "import os, retro; print(os.path.join(os.path.dirname(retro.__file__), 'data/stable/StreetFighterIISpecialChampionEdition-Genesis'))"- Copy your legally obtained ROM file to that directory as
rom.md - Copy state files from
data/sf/StreetFighterIISpecialChampionEdition-Genesis/to the retro directory - Verify the installation:
python3 -c "import retro; retro.make('StreetFighterIISpecialChampionEdition-Genesis')"Required State Files:
Champion.Level1.RyuVsGuile.state: Single-player training stateChampion.RyuVsRyu.2Player.align.state: Two-player training stateChampion.Select1P.Left.stateandChampion.Select1P.Right.state: Character selection states
Disclaimer: We are unable to provide you with any game ROMs. It is the user's own legal responsibility to acquire a game ROM for emulation. This library should only be used for non-commercial research purposes.
Environment is specified in main/common/retro_wrappers.py. It tracks the inner states of the game, and is compatible with Gymnasium interface and popular RL packages such as stable-baselines3.
Algorithms is implemented in main/common/algorithms.py and main/common/league.py. Specifically, IPPO in algorithms.py implements IPPO and 2Timescale methods, and League, PSRO, and FSP is implemented in league.py. We use PPO in stable-baselines3 as the backbone algorithm for all these implementations. The League implementation adapts the pseudocode in main/common/pseudocode, which is from previous work AlphaStar.
Environment Wrapper (main/common/retro_wrappers.py)
SFWrapper: Main environment wrapper that adapts Retro for RL training- Handles frame stacking (default: 12 frames with 4-step frames)
- Custom action space with combo support
- Reward shaping with aggressive and dense rewards
- Supports both single-agent and multi-agent modes
- Compatible with Gymnasium interface
Algorithms (main/common/algorithms.py)
IPPO: Independent PPO for multi-agent training with asymmetric learning ratesLeaguePPO: PPO variant for League training with historical checkpoint management- Both extend Stable-Baselines3 PPO with custom training loops
League Training (main/common/league.py)
LeagueManager: Core League training logic from AlphaStar pseudocodePayoff: Tracks win/loss statistics between policiesNashEquilibriumECOSSolver: Computes Nash equilibria using ECOS solver- Supports PSRO (PSRO) and FSP (Fictitious Self-Play) variants
Training Scripts:
train.py: Single-agent training vs built-in CPUippo.py: Multi-agent IPPO/2Timescale trainingtrain_ma.py: League/PSRO/FSP trainingbest_response.py: Exploiter training against fixed opponentfinetune.py: Curriculum learningevaluate_elo.py: ELO rating system for policy evaluation
Utilities (main/common/utils.py):
SubprocVecEnv2P: Custom vectorized environment for multi-agent trainingVecTransposeImage2P: Image transposition for 2-player observationslinear_schedule: Learning rate schedulingAnnealDenseCallback/AnnealAgressiveCallback: Reward shapingget_agent_enemy_hp: HP tracking for both perspectives
Action Space:
- Base: 12 discrete actions (directions + attacks)
- With combos: 15 actions (12 + 3 combo bits)
transform-action=True: Converts to MultiDiscrete space (recommended)- Combos are encoded as binary sequences that get mapped to button presses
Observation Space:
- Stacked frames: Default 12 frames, downsampled by factor of 2
- Frame skipping: Default 8 step frames
- Shape: (100, 128, 6) after stacking and downsampling
Reward Structure:
- Base: Dense reward from damage dealt and distance management
- Aggressive bonus: Encourages forward movement and attacks
- Rewards anneal over training to transition to sparse rewards
Reset Types:
round: Reset after each round (fastest)match: Reset after 2-round matchgame: Reset after full game completion
Game state files are stored in data/sf/StreetFighterIISpecialChampionEdition-Genesis/:
.statefiles: Game state snapshots for consistent trainingstars/: Sub-directory for star-based difficulty statescurriculum/: Curriculum learning state files
Required state files:
Champion.Level1.RyuVsGuile.state: Single-player stateChampion.RyuVsRyu.2Player.align.state: Two-player training stateChampion.Select1P.Left.stateandChampion.Select1P.Right.state: Character selection
Generate and refresh state files:
# Generate star state files
python generate_star_states.py
# Refresh star state files
python refresh_star_states.pymain/ directory:
cd mainThe updated commands organize training outputs by algorithm type for better clarity:
main/
├── trained_models/
│ ├── ppo_single_agent/ # Single-agent PPO vs CPU
│ │ ├── ppo_ryu_left_star1/
│ │ ├── ppo_ryu_left_star8/
│ │ └── ppo_ryu_right_star8/
│ ├── curriculum/ # Curriculum learning
│ ├── ippo/ # IPPO and 2Timescale
│ │ ├── ippo_ryu_2p_scale_1_0/
│ │ └── ippo_ryu_2p_scale_0_5/
│ ├── league/ # League training
│ ├── psro/ # PSRO training
│ ├── fsp/ # FSP training
│ └── best_response/ # Best response / exploiter
├── logs/ # Same structure as trained_models/
├── videos/ # Same structure as trained_models/
└── finetune/ # Same structure as trained_models/
This organization makes it easier to:
- 🔍 Compare different algorithms
- 📊 Organize experiments systematically
- 🚀 Scale to multiple runs with different seeds
# level: arcade opponent (1-15)
# star: CPU difficulty (1-8)
# side: left or right
python train.py --reset=round \
--level=${level} \
--star=${star} \
--side=${side} \
--model-name-prefix=ppo_ryu_${side}_L${level}_S${star} \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--log-dir=logs/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--video-dir=videos/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--num-epoch=50 \
--enable-combo --null-combo --transform-actionExample (train the left agent on level 3, star 2):
python train.py --reset=round \
--level=3 \
--star=2 \
--side=left \
--model-name-prefix=ppo_ryu_left_L3_S2 \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_left_L3_S2 \
--log-dir=logs/ppo_single_agent/ppo_ryu_left_L3_S2 \
--video-dir=videos/ppo_single_agent/ppo_ryu_left_L3_S2 \
--num-epoch=50 \
--num-env=32 \
--enable-combo --null-combo --transform-actionYou can still pass
--statemanually for custom checkpoints. When--levelis provided, the script auto-resolves the appropriate state file underdata/sf/StreetFighterIISpecialChampionEdition-Genesis/stars/.
python finetune.py --reset=round \
--model-name-prefix=ppo_ryu_curriculum \
--save-dir=trained_models/curriculum/ppo_ryu_curriculum \
--log-dir=logs/curriculum/ppo_ryu_curriculum \
--video-dir=videos/curriculum/ppo_ryu_curriculum \
--finetune-dir=finetune/curriculum/ppo_ryu_curriculum \
--num-epoch=25 \
--enable-combo --null-combo --transform-action# Replace ${task}, ${scale}, and ${seed} with actual values
# task: round, match, or game
# scale: 1 (IPPO) or other values for 2Timescale (e.g., 0.5, 2.0)
# seed: random seed (e.g., 0, 1, 2)
python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_${scale}_${seed} \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--log-dir=logs/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--video-dir=videos/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--other-timescale=${scale} \
--seed=${seed}Example (IPPO with scale=1):
python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_1_0 \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_1_0 \
--log-dir=logs/ippo/ippo_ryu_2p_scale_1_0 \
--video-dir=videos/ippo/ippo_ryu_2p_scale_1_0 \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_1_0 \
--num-epoch=50 \
--num-env=16 \
--enable-combo --null-combo --transform-action \
--other-timescale=1 \
--seed=0# Basic League training (requires pre-trained initial policies)
python train_ma.py --reset=round \
--save-dir=trained_models/league/league_ryu_seed_0 \
--log-dir=logs/league/league_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--seed=0
# For PSRO: add --psro-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/psro/psro_ryu_seed_0 \
--log-dir=logs/psro/psro_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--psro-league \
--seed=0
# For FSP: add --fsp-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/fsp/fsp_ryu_seed_0 \
--log-dir=logs/fsp/fsp_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--fsp-league \
--seed=0# Train an exploiter against a fixed opponent policy
# Use --update-right=0 to freeze the right policy (exploit it)
# Use --update-left=0 to freeze the left policy (exploit it)
python best_response.py --reset=round \
--model-name-prefix=br_opponent1_seed_0 \
--save-dir=trained_models/best_response/opponent1/seed_0 \
--log-dir=logs/best_response/opponent1/seed_0 \
--video-dir=videos/best_response/opponent1/seed_0 \
--finetune-dir=finetune/best_response/opponent1/seed_0 \
--model-file=path/to/opponent_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0
# Alternative: Load separate left and right policies
python best_response.py --reset=round \
--model-name-prefix=br_mixed_seed_0 \
--save-dir=trained_models/best_response/mixed/seed_0 \
--log-dir=logs/best_response/mixed/seed_0 \
--left-model-file=path/to/left_model \
--right-model-file=path/to/right_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0# Interactive play mode
# Edit MODEL_PATH in play_with_ai.py before running
# Key mappings are defined in common/interactive.py
python play_with_ai.pyNote: You can integrate your own games by implementing a wrapper environment similar to
main/common/retro_wrappers.py.
Common Arguments:
-
--reset: Determines when to reset the environmentround: Reset after each round (fastest training)match: Reset after 2-round matchgame: Reset after full game completion
-
--enable-combo: Enables special move combos in action space -
--null-combo: Adds null combo action (no special move) -
--transform-action: Transforms action space to MultiDiscrete (recommended)
Model Management:
- Models are saved periodically during training
- Final model:
{save-dir}/{model-name-prefix}_final_steps - Use
--model-fileto resume training from a checkpoint
Multi-Agent Specific:
-
--other-timescale: Learning rate scale for the second agent (2Timescale method)- Value 1.0 = IPPO (both agents learn at same rate)
- Value < 1.0 = Second agent learns slower
- Value > 1.0 = Second agent learns faster
-
--update-left/--update-right: Control which agent to train (0 = freeze, 1 = train)
The IPPO class supports asymmetric learning rates:
update_left/update_right: Boolean flags to freeze specific agentsother_learning_rate: Learning rate scale for the second agent- Scale = 1.0: Standard IPPO (symmetric learning)
- Scale < 1.0: Second agent learns slower
- Scale > 1.0: Second agent learns faster
League implementation follows the AlphaStar pseudocode in main/common/pseudocode/:
alphastar.py: Core League algorithm pseudocodemultiagent.py: Multi-agent extensionsrl.py: RL-specific utilitiessupervised.py: Supervised learning components
Training logs are saved to the --log-dir directory and can be visualized with TensorBoard:
tensorboard --logdir=logs/Videos of agent performance are saved to --video-dir during evaluation phases.
- Single-Agent vs CPU (50 epochs): ~4-8 hours on modern GPU
- IPPO (50 epochs): ~8-12 hours on modern GPU
- League Training: Varies significantly based on configuration
- Minimum: NVIDIA GTX 1060 (6GB VRAM)
- Recommended: NVIDIA RTX 3060 or better (12GB+ VRAM)
- For League training: RTX 3080 or better recommended
- Reduce
--num-envif running out of memory - Frame stacking and frame skipping reduce per-step computation
- Use
transform-action=Truefor more efficient action processing
- Each agent trains independently with PPO
- Supports asymmetric learning rates via
other_timescale - Environment is single-environment, multi-agent (not separate envs)
- Maintains a league of policies (main, league,先祖)
- Uses Nash equilibrium for match selection
- Payoff matrix tracks win rates between all policies
- Periodically adds new policies based on performance
- Extension of IPPO with different learning rates
- Can have one agent learn faster than the other
- Useful for asymmetric game scenarios
- Trains an agent to exploit a fixed opponent policy
- Use
--update-left=0or--update-right=0to freeze opponents - Supports loading separate left/right policies
compute_nash(): Nash equilibrium computation using ECOS solver- Used in League training for strategy selection
- Handles payoff matrices from multi-agent interactions
python train.py --model-file=path/to/existing/model \
--save-dir=new/save/directory ...- Models saved as PyTorch
.zipfiles - Compatible with
stable_baselines3.PPO.load() - Use
evaluate()functions for policy evaluation
Final model format: {save-dir}/{model-name-prefix}_final_steps
Models are saved at regular intervals (controlled by --save-freq in callbacks).
from stable_baselines3 import PPO
model = PPO.load("path/to/model.zip")
obs = env.reset()
action, _ = model.predict(obs, deterministic=True)Q: ImportError: No module named 'gym'
A: This project now uses Gymnasium. Make sure you have installed gymnasium==1.2.1 instead of the old gym package.
Q: Environment reset returns tuple instead of observation
A: This is expected behavior in Gymnasium. The code handles both formats automatically.
Q: Render mode errors / TypeError: render() got an unexpected keyword argument 'mode'
A: The render API has been updated. The code now correctly uses render_mode during environment creation. Make sure you're using the latest version of the code.
Q: Python version compatibility A: Python 3.11+ is required. The project is configured to work with Python 3.11 and later versions.
Q: ROM not found error
A: Make sure you have properly set up the ROM file as rom.md in the stable-retro game directory. Follow the ROM setup instructions in the Setup section above.
Q: CUDA/GPU not detected
A: Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Q: Training is very slow
A:
- Ensure GPU is being used (check with
nvidia-smi) - Reduce
--num-envif running out of memory - Consider using
--reset=roundfor faster training iterations
Q: Model files are very large A: This is normal. PPO models with CNN backbones can be several hundred MB. Consider cleaning up intermediate checkpoints if disk space is limited.
Q: Video recording hangs
A: Video recording can slow down training. Set record=False in evaluation for faster testing.
Q: Migration from old code A: The project has migrated from OpenAI Gym to Gymnasium. Key changes include:
env.reset()now returns(observation, info)tupleenv.step()returns(observation, reward, terminated, truncated, info)- Use
render_modeparameter during environment creation instead ofmode
Q: How should I organize my training results? A: We recommend using the algorithm-based directory structure as shown in the "Directory Structure for Training Results" section above. This organizes models, logs, and videos by algorithm type (ppo_single_agent, ippo, league, etc.).
Q: Gymnasium API differences from OpenAI Gym A: The project now uses Gymnasium API:
# Old (Gym)
obs = env.reset()
obs, reward, done, info = env.step(action)
# New (Gymnasium)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncatedIf you find our repo useful, please consider cite our work:
@inproceedings{lifightladder,
title={FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning},
author={Li, Wenzhe and Ding, Zihan and Karten, Seth and Jin, Chi},
booktitle={Forty-first International Conference on Machine Learning}
}
This architecture enables comprehensive multi-agent RL research in competitive gaming environments, with flexible training modes and thorough evaluation capabilities. FightLadder provides:
- Modular Design: Each component (environment, algorithms, training scripts) is independently extensible
- Multiple Training Paradigms: From simple single-agent training to complex League-based approaches
- Research-Ready: Nash equilibrium computation, ELO rating system, and comprehensive logging
- Production-Ready: Robust error handling, memory optimization, and scalable training workflows
Whether you're conducting academic research or developing AI for competitive gaming, FightLadder provides the tools and infrastructure needed for advanced multi-agent reinforcement learning experiments.