FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning

Latest Version:

Gymnasium 1.2.1 (migrated from OpenAI Gym)

Stable-Baselines3 2.7.0 (PPO as base algorithm)

Stable-Retro 0.9.5 (for game emulation)

PyTorch for deep learning

Nash equilibrium computation using scipy/ecos

FightLadder is a comprehensive benchmark for competitive multi-agent reinforcement learning, built on Street Fighter II: Special Champion Edition. It provides implementations of various multi-agent RL algorithms including IPPO, League training, PSRO, FSP, and Best Response methods, enabling advanced research in competitive gaming environments.

Setup

Platform: Linux

Python: 3.11+ (Required)

Create environment:

conda create -n fightladder python=3.11
conda activate fightladder

Install dependencies:

Core dependencies (Required):

# Reinforcement Learning frameworks
pip install gymnasium==1.2.1
pip install stable-retro==0.9.5
pip install stable-baselines3==2.7.0

# Deep learning backend
pip install torch torchvision torchaudio

# Scientific computing and Nash equilibrium solver
pip install numpy scipy ecos

# Image and video processing
pip install pillow av pyglet

Optional dependencies (Recommended):

For training monitoring and visualization:

pip install tensorboard matplotlib pandas tqdm

For system monitoring:

pip install gpustat nvidia-ml-py psutil

For hyperparameter optimization:

pip install optuna scikit-learn

Quick install (all dependencies):

pip install gymnasium==1.2.1 stable-retro==0.9.5 stable-baselines3==2.7.0 \
    torch torchvision torchaudio \
    numpy scipy ecos pillow av pyglet \
    tensorboard matplotlib pandas tqdm

Note: You can install optional dependencies as needed:

For system monitoring: gpustat nvidia-ml-py psutil

For hyperparameter optimization: optuna scikit-learn

Note:

scipy is required for Nash equilibrium computation in main/common/nash.py

ecos is the solver used for computing Nash equilibrium strategies

av is needed for video encoding/decoding functionality

Find out the stable-retro game folder:

import os
import retro

retro_directory = os.path.dirname(retro.__file__)
game_dir = "data/stable/StreetFighterIISpecialChampionEdition-Genesis"
print(os.path.join(retro_directory, game_dir))

Setup ROM and State Files:

⚠️ Important: You need to legally obtain and install the Street Fighter II ROM file before running any training.

ROM Setup Instructions:

Find the stable-retro game directory:

python3 -c "import os, retro; print(os.path.join(os.path.dirname(retro.__file__), 'data/stable/StreetFighterIISpecialChampionEdition-Genesis'))"

Copy your legally obtained ROM file to that directory as rom.md
Copy state files from data/sf/StreetFighterIISpecialChampionEdition-Genesis/ to the retro directory
Verify the installation:

python3 -c "import retro; retro.make('StreetFighterIISpecialChampionEdition-Genesis')"

Required State Files:

Champion.Level1.RyuVsGuile.state: Single-player training state
Champion.RyuVsRyu.2Player.align.state: Two-player training state
Champion.Select1P.Left.state and Champion.Select1P.Right.state: Character selection states

Disclaimer: We are unable to provide you with any game ROMs. It is the user's own legal responsibility to acquire a game ROM for emulation. This library should only be used for non-commercial research purposes.

Key concepts

Environment is specified in main/common/retro_wrappers.py. It tracks the inner states of the game, and is compatible with Gymnasium interface and popular RL packages such as stable-baselines3.

Algorithms is implemented in main/common/algorithms.py and main/common/league.py. Specifically, IPPO in algorithms.py implements IPPO and 2Timescale methods, and League, PSRO, and FSP is implemented in league.py. We use PPO in stable-baselines3 as the backbone algorithm for all these implementations. The League implementation adapts the pseudocode in main/common/pseudocode, which is from previous work AlphaStar.

Code Architecture

Core Components

Environment Wrapper (main/common/retro_wrappers.py)

SFWrapper: Main environment wrapper that adapts Retro for RL training
Handles frame stacking (default: 12 frames with 4-step frames)
Custom action space with combo support
Reward shaping with aggressive and dense rewards
Supports both single-agent and multi-agent modes
Compatible with Gymnasium interface

Algorithms (main/common/algorithms.py)

IPPO: Independent PPO for multi-agent training with asymmetric learning rates
LeaguePPO: PPO variant for League training with historical checkpoint management
Both extend Stable-Baselines3 PPO with custom training loops

League Training (main/common/league.py)

LeagueManager: Core League training logic from AlphaStar pseudocode
Payoff: Tracks win/loss statistics between policies
NashEquilibriumECOSSolver: Computes Nash equilibria using ECOS solver
Supports PSRO (PSRO) and FSP (Fictitious Self-Play) variants

Training Scripts:

train.py: Single-agent training vs built-in CPU
ippo.py: Multi-agent IPPO/2Timescale training
train_ma.py: League/PSRO/FSP training
best_response.py: Exploiter training against fixed opponent
finetune.py: Curriculum learning
evaluate_elo.py: ELO rating system for policy evaluation

Utilities (main/common/utils.py):

SubprocVecEnv2P: Custom vectorized environment for multi-agent training
VecTransposeImage2P: Image transposition for 2-player observations
linear_schedule: Learning rate scheduling
AnnealDenseCallback / AnnealAgressiveCallback: Reward shaping
get_agent_enemy_hp: HP tracking for both perspectives

Key Implementation Details

Action Space:

Base: 12 discrete actions (directions + attacks)
With combos: 15 actions (12 + 3 combo bits)
transform-action=True: Converts to MultiDiscrete space (recommended)
Combos are encoded as binary sequences that get mapped to button presses

Observation Space:

Stacked frames: Default 12 frames, downsampled by factor of 2
Frame skipping: Default 8 step frames
Shape: (100, 128, 6) after stacking and downsampling

Reward Structure:

Base: Dense reward from damage dealt and distance management
Aggressive bonus: Encourages forward movement and attacks
Rewards anneal over training to transition to sparse rewards

Reset Types:

round: Reset after each round (fastest)
match: Reset after 2-round match
game: Reset after full game completion

State Files Management

Game state files are stored in data/sf/StreetFighterIISpecialChampionEdition-Genesis/:

.state files: Game state snapshots for consistent training
stars/: Sub-directory for star-based difficulty states
curriculum/: Curriculum learning state files

Required state files:

Champion.Level1.RyuVsGuile.state: Single-player state
Champion.RyuVsRyu.2Player.align.state: Two-player training state
Champion.Select1P.Left.state and Champion.Select1P.Right.state: Character selection

Generate and refresh state files:

# Generate star state files
python generate_star_states.py

# Refresh star state files
python refresh_star_states.py

Run the experiment

⚠️ Note: All commands below should be run from the main/ directory:

cd main

Directory Structure for Training Results

The updated commands organize training outputs by algorithm type for better clarity:

main/
├── trained_models/
│   ├── ppo_single_agent/     # Single-agent PPO vs CPU
│   │   ├── ppo_ryu_left_star1/
│   │   ├── ppo_ryu_left_star8/
│   │   └── ppo_ryu_right_star8/
│   ├── curriculum/            # Curriculum learning
│   ├── ippo/                  # IPPO and 2Timescale
│   │   ├── ippo_ryu_2p_scale_1_0/
│   │   └── ippo_ryu_2p_scale_0_5/
│   ├── league/                # League training
│   ├── psro/                  # PSRO training
│   ├── fsp/                   # FSP training
│   └── best_response/         # Best response / exploiter
├── logs/                      # Same structure as trained_models/
├── videos/                    # Same structure as trained_models/
└── finetune/                  # Same structure as trained_models/

This organization makes it easier to:

🔍 Compare different algorithms
📊 Organize experiments systematically
🚀 Scale to multiple runs with different seeds

Single-Agent RL against built-in CPU player:

# level: arcade opponent (1-15)
# star: CPU difficulty (1-8)
# side: left or right
python train.py --reset=round \
--level=${level} \
--star=${star} \
--side=${side} \
--model-name-prefix=ppo_ryu_${side}_L${level}_S${star} \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--log-dir=logs/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--video-dir=videos/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--num-epoch=50 \
--enable-combo --null-combo --transform-action

Example (train the left agent on level 3, star 2):

python train.py --reset=round \
--level=3 \
--star=2 \
--side=left \
--model-name-prefix=ppo_ryu_left_L3_S2 \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_left_L3_S2 \
--log-dir=logs/ppo_single_agent/ppo_ryu_left_L3_S2 \
--video-dir=videos/ppo_single_agent/ppo_ryu_left_L3_S2 \
--num-epoch=50 \
--num-env=32 \
--enable-combo --null-combo --transform-action

You can still pass --state manually for custom checkpoints. When --level is provided, the script auto-resolves the appropriate state file under data/sf/StreetFighterIISpecialChampionEdition-Genesis/stars/.

Curriculum Learning:

python finetune.py --reset=round \
--model-name-prefix=ppo_ryu_curriculum \
--save-dir=trained_models/curriculum/ppo_ryu_curriculum \
--log-dir=logs/curriculum/ppo_ryu_curriculum \
--video-dir=videos/curriculum/ppo_ryu_curriculum \
--finetune-dir=finetune/curriculum/ppo_ryu_curriculum \
--num-epoch=25 \
--enable-combo --null-combo --transform-action

Multi-Agent: IPPO / 2Timescale:

# Replace ${task}, ${scale}, and ${seed} with actual values
# task: round, match, or game
# scale: 1 (IPPO) or other values for 2Timescale (e.g., 0.5, 2.0)
# seed: random seed (e.g., 0, 1, 2)

python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_${scale}_${seed} \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--log-dir=logs/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--video-dir=videos/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--other-timescale=${scale} \
--seed=${seed}

Example (IPPO with scale=1):

python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_1_0 \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_1_0 \
--log-dir=logs/ippo/ippo_ryu_2p_scale_1_0 \
--video-dir=videos/ippo/ippo_ryu_2p_scale_1_0 \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_1_0 \
--num-epoch=50 \
--num-env=16 \
--enable-combo --null-combo --transform-action \
--other-timescale=1 \
--seed=0

Multi-Agent: League / PSRO / FSP:

# Basic League training (requires pre-trained initial policies)
python train_ma.py --reset=round \
--save-dir=trained_models/league/league_ryu_seed_0 \
--log-dir=logs/league/league_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--seed=0

# For PSRO: add --psro-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/psro/psro_ryu_seed_0 \
--log-dir=logs/psro/psro_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--psro-league \
--seed=0

# For FSP: add --fsp-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/fsp/fsp_ryu_seed_0 \
--log-dir=logs/fsp/fsp_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--fsp-league \
--seed=0

Best Response / Exploiter Training:

# Train an exploiter against a fixed opponent policy
# Use --update-right=0 to freeze the right policy (exploit it)
# Use --update-left=0 to freeze the left policy (exploit it)

python best_response.py --reset=round \
--model-name-prefix=br_opponent1_seed_0 \
--save-dir=trained_models/best_response/opponent1/seed_0 \
--log-dir=logs/best_response/opponent1/seed_0 \
--video-dir=videos/best_response/opponent1/seed_0 \
--finetune-dir=finetune/best_response/opponent1/seed_0 \
--model-file=path/to/opponent_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0

# Alternative: Load separate left and right policies
python best_response.py --reset=round \
--model-name-prefix=br_mixed_seed_0 \
--save-dir=trained_models/best_response/mixed/seed_0 \
--log-dir=logs/best_response/mixed/seed_0 \
--left-model-file=path/to/left_model \
--right-model-file=path/to/right_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0

Play with Trained Policies:

# Interactive play mode
# Edit MODEL_PATH in play_with_ai.py before running
# Key mappings are defined in common/interactive.py
python play_with_ai.py

Note: You can integrate your own games by implementing a wrapper environment similar to main/common/retro_wrappers.py.

Training Tips

Understanding Arguments

Common Arguments:

--reset: Determines when to reset the environment
- round: Reset after each round (fastest training)
- match: Reset after 2-round match
- game: Reset after full game completion
--enable-combo: Enables special move combos in action space
--null-combo: Adds null combo action (no special move)
--transform-action: Transforms action space to MultiDiscrete (recommended)

Model Management:

Models are saved periodically during training
Final model: {save-dir}/{model-name-prefix}_final_steps
Use --model-file to resume training from a checkpoint

Multi-Agent Specific:

--other-timescale: Learning rate scale for the second agent (2Timescale method)
- Value 1.0 = IPPO (both agents learn at same rate)
- Value < 1.0 = Second agent learns slower
- Value > 1.0 = Second agent learns faster
--update-left / --update-right: Control which agent to train (0 = freeze, 1 = train)

Multi-Agent Asymmetric Training

The IPPO class supports asymmetric learning rates:

update_left / update_right: Boolean flags to freeze specific agents
other_learning_rate: Learning rate scale for the second agent
Scale = 1.0: Standard IPPO (symmetric learning)
Scale < 1.0: Second agent learns slower
Scale > 1.0: Second agent learns faster

League Training Implementation

League implementation follows the AlphaStar pseudocode in main/common/pseudocode/:

alphastar.py: Core League algorithm pseudocode
multiagent.py: Multi-agent extensions
rl.py: RL-specific utilities
supervised.py: Supervised learning components

Monitoring Training

Training logs are saved to the --log-dir directory and can be visualized with TensorBoard:

tensorboard --logdir=logs/

Videos of agent performance are saved to --video-dir during evaluation phases.

Expected Training Time

Single-Agent vs CPU (50 epochs): ~4-8 hours on modern GPU
IPPO (50 epochs): ~8-12 hours on modern GPU
League Training: Varies significantly based on configuration

GPU Recommendations

Minimum: NVIDIA GTX 1060 (6GB VRAM)
Recommended: NVIDIA RTX 3060 or better (12GB+ VRAM)
For League training: RTX 3080 or better recommended

Memory Optimization

Reduce --num-env if running out of memory
Frame stacking and frame skipping reduce per-step computation
Use transform-action=True for more efficient action processing

Algorithm Details

IPPO (Independent PPO)

Each agent trains independently with PPO
Supports asymmetric learning rates via other_timescale
Environment is single-environment, multi-agent (not separate envs)

League Training

Maintains a league of policies (main, league,先祖)
Uses Nash equilibrium for match selection
Payoff matrix tracks win rates between all policies
Periodically adds new policies based on performance

2Timescale

Extension of IPPO with different learning rates
Can have one agent learn faster than the other
Useful for asymmetric game scenarios

Best Response

Trains an agent to exploit a fixed opponent policy
Use --update-left=0 or --update-right=0 to freeze opponents
Supports loading separate left/right policies

Nash Equilibrium Computation

compute_nash(): Nash equilibrium computation using ECOS solver
Used in League training for strategy selection
Handles payoff matrices from multi-agent interactions

Working with Existing Models

Resuming Training

python train.py --model-file=path/to/existing/model \
  --save-dir=new/save/directory ...

Loading Models for Evaluation

Models saved as PyTorch .zip files
Compatible with stable_baselines3.PPO.load()
Use evaluate() functions for policy evaluation

Model Checkpoint Format

Final model format: {save-dir}/{model-name-prefix}_final_steps Models are saved at regular intervals (controlled by --save-freq in callbacks).

Model Inference

from stable_baselines3 import PPO

model = PPO.load("path/to/model.zip")
obs = env.reset()
action, _ = model.predict(obs, deterministic=True)

Troubleshooting

Common Issues

Q: ImportError: No module named 'gym'
A: This project now uses Gymnasium. Make sure you have installed gymnasium==1.2.1 instead of the old gym package.

Q: Environment reset returns tuple instead of observation
A: This is expected behavior in Gymnasium. The code handles both formats automatically.

Q: Render mode errors / TypeError: render() got an unexpected keyword argument 'mode'
A: The render API has been updated. The code now correctly uses render_mode during environment creation. Make sure you're using the latest version of the code.

Q: Python version compatibility A: Python 3.11+ is required. The project is configured to work with Python 3.11 and later versions.

Q: ROM not found error A: Make sure you have properly set up the ROM file as rom.md in the stable-retro game directory. Follow the ROM setup instructions in the Setup section above.

Q: CUDA/GPU not detected
A: Install PyTorch with CUDA support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Q: Training is very slow
A:

Ensure GPU is being used (check with nvidia-smi)
Reduce --num-env if running out of memory
Consider using --reset=round for faster training iterations

Q: Model files are very large A: This is normal. PPO models with CNN backbones can be several hundred MB. Consider cleaning up intermediate checkpoints if disk space is limited.

Q: Video recording hangs A: Video recording can slow down training. Set record=False in evaluation for faster testing.

Q: Migration from old code A: The project has migrated from OpenAI Gym to Gymnasium. Key changes include:

env.reset() now returns (observation, info) tuple
env.step() returns (observation, reward, terminated, truncated, info)
Use render_mode parameter during environment creation instead of mode

Q: How should I organize my training results? A: We recommend using the algorithm-based directory structure as shown in the "Directory Structure for Training Results" section above. This organizes models, logs, and videos by algorithm type (ppo_single_agent, ippo, league, etc.).

Q: Gymnasium API differences from OpenAI Gym A: The project now uses Gymnasium API:

# Old (Gym)
obs = env.reset()
obs, reward, done, info = env.step(action)

# New (Gymnasium)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated

Citation

If you find our repo useful, please consider cite our work:

@inproceedings{lifightladder,
  title={FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning},
  author={Li, Wenzhe and Ding, Zihan and Karten, Seth and Jin, Chi},
  booktitle={Forty-first International Conference on Machine Learning}
}

Summary

This architecture enables comprehensive multi-agent RL research in competitive gaming environments, with flexible training modes and thorough evaluation capabilities. FightLadder provides:

Modular Design: Each component (environment, algorithms, training scripts) is independently extensible
Multiple Training Paradigms: From simple single-agent training to complex League-based approaches
Research-Ready: Nash equilibrium computation, ELO rating system, and comprehensive logging
Production-Ready: Robust error handling, memory optimization, and scalable training workflows

Whether you're conducting academic research or developing AI for competitive gaming, FightLadder provides the tools and infrastructure needed for advanced multi-agent reinforcement learning experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
data		data
main		main
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
plot_elo_progression.py		plot_elo_progression.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

XiiTang/FightLadder

Folders and files

Latest commit

History

Repository files navigation