Skip to content

Official repository of the paper "FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning"

License

Notifications You must be signed in to change notification settings

XiiTang/FightLadder

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning

Latest Version:

  • Gymnasium 1.2.1 (migrated from OpenAI Gym)
  • Stable-Baselines3 2.7.0 (PPO as base algorithm)
  • Stable-Retro 0.9.5 (for game emulation)
  • PyTorch for deep learning
  • Nash equilibrium computation using scipy/ecos

FightLadder is a comprehensive benchmark for competitive multi-agent reinforcement learning, built on Street Fighter II: Special Champion Edition. It provides implementations of various multi-agent RL algorithms including IPPO, League training, PSRO, FSP, and Best Response methods, enabling advanced research in competitive gaming environments.

Setup

Platform: Linux

Python: 3.11+ (Required)

Create environment:

conda create -n fightladder python=3.11
conda activate fightladder

Install dependencies:

Core dependencies (Required):

# Reinforcement Learning frameworks
pip install gymnasium==1.2.1
pip install stable-retro==0.9.5
pip install stable-baselines3==2.7.0

# Deep learning backend
pip install torch torchvision torchaudio

# Scientific computing and Nash equilibrium solver
pip install numpy scipy ecos

# Image and video processing
pip install pillow av pyglet

Optional dependencies (Recommended):

For training monitoring and visualization:

pip install tensorboard matplotlib pandas tqdm

For system monitoring:

pip install gpustat nvidia-ml-py psutil

For hyperparameter optimization:

pip install optuna scikit-learn

Quick install (all dependencies):

pip install gymnasium==1.2.1 stable-retro==0.9.5 stable-baselines3==2.7.0 \
    torch torchvision torchaudio \
    numpy scipy ecos pillow av pyglet \
    tensorboard matplotlib pandas tqdm

Note: You can install optional dependencies as needed:

  • For system monitoring: gpustat nvidia-ml-py psutil
  • For hyperparameter optimization: optuna scikit-learn

Note:

  • scipy is required for Nash equilibrium computation in main/common/nash.py
  • ecos is the solver used for computing Nash equilibrium strategies
  • av is needed for video encoding/decoding functionality

Find out the stable-retro game folder:

import os
import retro

retro_directory = os.path.dirname(retro.__file__)
game_dir = "data/stable/StreetFighterIISpecialChampionEdition-Genesis"
print(os.path.join(retro_directory, game_dir))

Setup ROM and State Files:

⚠️ Important: You need to legally obtain and install the Street Fighter II ROM file before running any training.

ROM Setup Instructions:

  1. Find the stable-retro game directory:
python3 -c "import os, retro; print(os.path.join(os.path.dirname(retro.__file__), 'data/stable/StreetFighterIISpecialChampionEdition-Genesis'))"
  1. Copy your legally obtained ROM file to that directory as rom.md
  2. Copy state files from data/sf/StreetFighterIISpecialChampionEdition-Genesis/ to the retro directory
  3. Verify the installation:
python3 -c "import retro; retro.make('StreetFighterIISpecialChampionEdition-Genesis')"

Required State Files:

  • Champion.Level1.RyuVsGuile.state: Single-player training state
  • Champion.RyuVsRyu.2Player.align.state: Two-player training state
  • Champion.Select1P.Left.state and Champion.Select1P.Right.state: Character selection states

Disclaimer: We are unable to provide you with any game ROMs. It is the user's own legal responsibility to acquire a game ROM for emulation. This library should only be used for non-commercial research purposes.

Key concepts

Environment is specified in main/common/retro_wrappers.py. It tracks the inner states of the game, and is compatible with Gymnasium interface and popular RL packages such as stable-baselines3.

Algorithms is implemented in main/common/algorithms.py and main/common/league.py. Specifically, IPPO in algorithms.py implements IPPO and 2Timescale methods, and League, PSRO, and FSP is implemented in league.py. We use PPO in stable-baselines3 as the backbone algorithm for all these implementations. The League implementation adapts the pseudocode in main/common/pseudocode, which is from previous work AlphaStar.

Code Architecture

Core Components

Environment Wrapper (main/common/retro_wrappers.py)

  • SFWrapper: Main environment wrapper that adapts Retro for RL training
  • Handles frame stacking (default: 12 frames with 4-step frames)
  • Custom action space with combo support
  • Reward shaping with aggressive and dense rewards
  • Supports both single-agent and multi-agent modes
  • Compatible with Gymnasium interface

Algorithms (main/common/algorithms.py)

  • IPPO: Independent PPO for multi-agent training with asymmetric learning rates
  • LeaguePPO: PPO variant for League training with historical checkpoint management
  • Both extend Stable-Baselines3 PPO with custom training loops

League Training (main/common/league.py)

  • LeagueManager: Core League training logic from AlphaStar pseudocode
  • Payoff: Tracks win/loss statistics between policies
  • NashEquilibriumECOSSolver: Computes Nash equilibria using ECOS solver
  • Supports PSRO (PSRO) and FSP (Fictitious Self-Play) variants

Training Scripts:

  • train.py: Single-agent training vs built-in CPU
  • ippo.py: Multi-agent IPPO/2Timescale training
  • train_ma.py: League/PSRO/FSP training
  • best_response.py: Exploiter training against fixed opponent
  • finetune.py: Curriculum learning
  • evaluate_elo.py: ELO rating system for policy evaluation

Utilities (main/common/utils.py):

  • SubprocVecEnv2P: Custom vectorized environment for multi-agent training
  • VecTransposeImage2P: Image transposition for 2-player observations
  • linear_schedule: Learning rate scheduling
  • AnnealDenseCallback / AnnealAgressiveCallback: Reward shaping
  • get_agent_enemy_hp: HP tracking for both perspectives

Key Implementation Details

Action Space:

  • Base: 12 discrete actions (directions + attacks)
  • With combos: 15 actions (12 + 3 combo bits)
  • transform-action=True: Converts to MultiDiscrete space (recommended)
  • Combos are encoded as binary sequences that get mapped to button presses

Observation Space:

  • Stacked frames: Default 12 frames, downsampled by factor of 2
  • Frame skipping: Default 8 step frames
  • Shape: (100, 128, 6) after stacking and downsampling

Reward Structure:

  • Base: Dense reward from damage dealt and distance management
  • Aggressive bonus: Encourages forward movement and attacks
  • Rewards anneal over training to transition to sparse rewards

Reset Types:

  • round: Reset after each round (fastest)
  • match: Reset after 2-round match
  • game: Reset after full game completion

State Files Management

Game state files are stored in data/sf/StreetFighterIISpecialChampionEdition-Genesis/:

  • .state files: Game state snapshots for consistent training
  • stars/: Sub-directory for star-based difficulty states
  • curriculum/: Curriculum learning state files

Required state files:

  • Champion.Level1.RyuVsGuile.state: Single-player state
  • Champion.RyuVsRyu.2Player.align.state: Two-player training state
  • Champion.Select1P.Left.state and Champion.Select1P.Right.state: Character selection

Generate and refresh state files:

# Generate star state files
python generate_star_states.py

# Refresh star state files
python refresh_star_states.py

Run the experiment

⚠️ Note: All commands below should be run from the main/ directory:

cd main

Directory Structure for Training Results

The updated commands organize training outputs by algorithm type for better clarity:

main/
├── trained_models/
│   ├── ppo_single_agent/     # Single-agent PPO vs CPU
│   │   ├── ppo_ryu_left_star1/
│   │   ├── ppo_ryu_left_star8/
│   │   └── ppo_ryu_right_star8/
│   ├── curriculum/            # Curriculum learning
│   ├── ippo/                  # IPPO and 2Timescale
│   │   ├── ippo_ryu_2p_scale_1_0/
│   │   └── ippo_ryu_2p_scale_0_5/
│   ├── league/                # League training
│   ├── psro/                  # PSRO training
│   ├── fsp/                   # FSP training
│   └── best_response/         # Best response / exploiter
├── logs/                      # Same structure as trained_models/
├── videos/                    # Same structure as trained_models/
└── finetune/                  # Same structure as trained_models/

This organization makes it easier to:

  • 🔍 Compare different algorithms
  • 📊 Organize experiments systematically
  • 🚀 Scale to multiple runs with different seeds

Single-Agent RL against built-in CPU player:

# level: arcade opponent (1-15)
# star: CPU difficulty (1-8)
# side: left or right
python train.py --reset=round \
--level=${level} \
--star=${star} \
--side=${side} \
--model-name-prefix=ppo_ryu_${side}_L${level}_S${star} \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--log-dir=logs/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--video-dir=videos/ppo_single_agent/ppo_ryu_${side}_L${level}_S${star} \
--num-epoch=50 \
--enable-combo --null-combo --transform-action

Example (train the left agent on level 3, star 2):

python train.py --reset=round \
--level=3 \
--star=2 \
--side=left \
--model-name-prefix=ppo_ryu_left_L3_S2 \
--save-dir=trained_models/ppo_single_agent/ppo_ryu_left_L3_S2 \
--log-dir=logs/ppo_single_agent/ppo_ryu_left_L3_S2 \
--video-dir=videos/ppo_single_agent/ppo_ryu_left_L3_S2 \
--num-epoch=50 \
--num-env=32 \
--enable-combo --null-combo --transform-action

You can still pass --state manually for custom checkpoints. When --level is provided, the script auto-resolves the appropriate state file under data/sf/StreetFighterIISpecialChampionEdition-Genesis/stars/.

Curriculum Learning:

python finetune.py --reset=round \
--model-name-prefix=ppo_ryu_curriculum \
--save-dir=trained_models/curriculum/ppo_ryu_curriculum \
--log-dir=logs/curriculum/ppo_ryu_curriculum \
--video-dir=videos/curriculum/ppo_ryu_curriculum \
--finetune-dir=finetune/curriculum/ppo_ryu_curriculum \
--num-epoch=25 \
--enable-combo --null-combo --transform-action

Multi-Agent: IPPO / 2Timescale:

# Replace ${task}, ${scale}, and ${seed} with actual values
# task: round, match, or game
# scale: 1 (IPPO) or other values for 2Timescale (e.g., 0.5, 2.0)
# seed: random seed (e.g., 0, 1, 2)

python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_${scale}_${seed} \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--log-dir=logs/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--video-dir=videos/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_${scale}_${seed} \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--other-timescale=${scale} \
--seed=${seed}

Example (IPPO with scale=1):

python ippo.py --reset=round \
--model-name-prefix=ippo_ryu_2p_scale_1_0 \
--save-dir=trained_models/ippo/ippo_ryu_2p_scale_1_0 \
--log-dir=logs/ippo/ippo_ryu_2p_scale_1_0 \
--video-dir=videos/ippo/ippo_ryu_2p_scale_1_0 \
--finetune-dir=finetune/ippo/ippo_ryu_2p_scale_1_0 \
--num-epoch=50 \
--num-env=16 \
--enable-combo --null-combo --transform-action \
--other-timescale=1 \
--seed=0

Multi-Agent: League / PSRO / FSP:

# Basic League training (requires pre-trained initial policies)
python train_ma.py --reset=round \
--save-dir=trained_models/league/league_ryu_seed_0 \
--log-dir=logs/league/league_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--seed=0

# For PSRO: add --psro-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/psro/psro_ryu_seed_0 \
--log-dir=logs/psro/psro_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--psro-league \
--seed=0

# For FSP: add --fsp-league flag
python train_ma.py --reset=round \
--save-dir=trained_models/fsp/fsp_ryu_seed_0 \
--log-dir=logs/fsp/fsp_ryu_seed_0 \
--left-model-file=trained_models/ppo_single_agent/ppo_ryu_left_star8/ppo_ryu_left_star8_final_steps \
--right-model-file=trained_models/ppo_single_agent/ppo_ryu_right_star8/ppo_ryu_right_star8_final_steps \
--enable-combo --null-combo --transform-action \
--fsp-league \
--seed=0

Best Response / Exploiter Training:

# Train an exploiter against a fixed opponent policy
# Use --update-right=0 to freeze the right policy (exploit it)
# Use --update-left=0 to freeze the left policy (exploit it)

python best_response.py --reset=round \
--model-name-prefix=br_opponent1_seed_0 \
--save-dir=trained_models/best_response/opponent1/seed_0 \
--log-dir=logs/best_response/opponent1/seed_0 \
--video-dir=videos/best_response/opponent1/seed_0 \
--finetune-dir=finetune/best_response/opponent1/seed_0 \
--model-file=path/to/opponent_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0

# Alternative: Load separate left and right policies
python best_response.py --reset=round \
--model-name-prefix=br_mixed_seed_0 \
--save-dir=trained_models/best_response/mixed/seed_0 \
--log-dir=logs/best_response/mixed/seed_0 \
--left-model-file=path/to/left_model \
--right-model-file=path/to/right_model \
--num-epoch=50 \
--enable-combo --null-combo --transform-action \
--update-right=0 \
--seed=0

Play with Trained Policies:

# Interactive play mode
# Edit MODEL_PATH in play_with_ai.py before running
# Key mappings are defined in common/interactive.py
python play_with_ai.py

Note: You can integrate your own games by implementing a wrapper environment similar to main/common/retro_wrappers.py.

Training Tips

Understanding Arguments

Common Arguments:

  • --reset: Determines when to reset the environment

    • round: Reset after each round (fastest training)
    • match: Reset after 2-round match
    • game: Reset after full game completion
  • --enable-combo: Enables special move combos in action space

  • --null-combo: Adds null combo action (no special move)

  • --transform-action: Transforms action space to MultiDiscrete (recommended)

Model Management:

  • Models are saved periodically during training
  • Final model: {save-dir}/{model-name-prefix}_final_steps
  • Use --model-file to resume training from a checkpoint

Multi-Agent Specific:

  • --other-timescale: Learning rate scale for the second agent (2Timescale method)

    • Value 1.0 = IPPO (both agents learn at same rate)
    • Value < 1.0 = Second agent learns slower
    • Value > 1.0 = Second agent learns faster
  • --update-left / --update-right: Control which agent to train (0 = freeze, 1 = train)

Multi-Agent Asymmetric Training

The IPPO class supports asymmetric learning rates:

  • update_left / update_right: Boolean flags to freeze specific agents
  • other_learning_rate: Learning rate scale for the second agent
  • Scale = 1.0: Standard IPPO (symmetric learning)
  • Scale < 1.0: Second agent learns slower
  • Scale > 1.0: Second agent learns faster

League Training Implementation

League implementation follows the AlphaStar pseudocode in main/common/pseudocode/:

  • alphastar.py: Core League algorithm pseudocode
  • multiagent.py: Multi-agent extensions
  • rl.py: RL-specific utilities
  • supervised.py: Supervised learning components

Monitoring Training

Training logs are saved to the --log-dir directory and can be visualized with TensorBoard:

tensorboard --logdir=logs/

Videos of agent performance are saved to --video-dir during evaluation phases.

Expected Training Time

  • Single-Agent vs CPU (50 epochs): ~4-8 hours on modern GPU
  • IPPO (50 epochs): ~8-12 hours on modern GPU
  • League Training: Varies significantly based on configuration

GPU Recommendations

  • Minimum: NVIDIA GTX 1060 (6GB VRAM)
  • Recommended: NVIDIA RTX 3060 or better (12GB+ VRAM)
  • For League training: RTX 3080 or better recommended

Memory Optimization

  • Reduce --num-env if running out of memory
  • Frame stacking and frame skipping reduce per-step computation
  • Use transform-action=True for more efficient action processing

Algorithm Details

IPPO (Independent PPO)

  • Each agent trains independently with PPO
  • Supports asymmetric learning rates via other_timescale
  • Environment is single-environment, multi-agent (not separate envs)

League Training

  • Maintains a league of policies (main, league,先祖)
  • Uses Nash equilibrium for match selection
  • Payoff matrix tracks win rates between all policies
  • Periodically adds new policies based on performance

2Timescale

  • Extension of IPPO with different learning rates
  • Can have one agent learn faster than the other
  • Useful for asymmetric game scenarios

Best Response

  • Trains an agent to exploit a fixed opponent policy
  • Use --update-left=0 or --update-right=0 to freeze opponents
  • Supports loading separate left/right policies

Nash Equilibrium Computation

  • compute_nash(): Nash equilibrium computation using ECOS solver
  • Used in League training for strategy selection
  • Handles payoff matrices from multi-agent interactions

Working with Existing Models

Resuming Training

python train.py --model-file=path/to/existing/model \
  --save-dir=new/save/directory ...

Loading Models for Evaluation

  • Models saved as PyTorch .zip files
  • Compatible with stable_baselines3.PPO.load()
  • Use evaluate() functions for policy evaluation

Model Checkpoint Format

Final model format: {save-dir}/{model-name-prefix}_final_steps Models are saved at regular intervals (controlled by --save-freq in callbacks).

Model Inference

from stable_baselines3 import PPO

model = PPO.load("path/to/model.zip")
obs = env.reset()
action, _ = model.predict(obs, deterministic=True)

Troubleshooting

Common Issues

Q: ImportError: No module named 'gym'
A: This project now uses Gymnasium. Make sure you have installed gymnasium==1.2.1 instead of the old gym package.

Q: Environment reset returns tuple instead of observation
A: This is expected behavior in Gymnasium. The code handles both formats automatically.

Q: Render mode errors / TypeError: render() got an unexpected keyword argument 'mode'
A: The render API has been updated. The code now correctly uses render_mode during environment creation. Make sure you're using the latest version of the code.

Q: Python version compatibility A: Python 3.11+ is required. The project is configured to work with Python 3.11 and later versions.

Q: ROM not found error A: Make sure you have properly set up the ROM file as rom.md in the stable-retro game directory. Follow the ROM setup instructions in the Setup section above.

Q: CUDA/GPU not detected
A: Install PyTorch with CUDA support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Q: Training is very slow
A:

  • Ensure GPU is being used (check with nvidia-smi)
  • Reduce --num-env if running out of memory
  • Consider using --reset=round for faster training iterations

Q: Model files are very large A: This is normal. PPO models with CNN backbones can be several hundred MB. Consider cleaning up intermediate checkpoints if disk space is limited.

Q: Video recording hangs A: Video recording can slow down training. Set record=False in evaluation for faster testing.

Q: Migration from old code A: The project has migrated from OpenAI Gym to Gymnasium. Key changes include:

  • env.reset() now returns (observation, info) tuple
  • env.step() returns (observation, reward, terminated, truncated, info)
  • Use render_mode parameter during environment creation instead of mode

Q: How should I organize my training results? A: We recommend using the algorithm-based directory structure as shown in the "Directory Structure for Training Results" section above. This organizes models, logs, and videos by algorithm type (ppo_single_agent, ippo, league, etc.).

Q: Gymnasium API differences from OpenAI Gym A: The project now uses Gymnasium API:

# Old (Gym)
obs = env.reset()
obs, reward, done, info = env.step(action)

# New (Gymnasium)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated

Citation

If you find our repo useful, please consider cite our work:

@inproceedings{lifightladder,
  title={FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning},
  author={Li, Wenzhe and Ding, Zihan and Karten, Seth and Jin, Chi},
  booktitle={Forty-first International Conference on Machine Learning}
}

Summary

This architecture enables comprehensive multi-agent RL research in competitive gaming environments, with flexible training modes and thorough evaluation capabilities. FightLadder provides:

  • Modular Design: Each component (environment, algorithms, training scripts) is independently extensible
  • Multiple Training Paradigms: From simple single-agent training to complex League-based approaches
  • Research-Ready: Nash equilibrium computation, ELO rating system, and comprehensive logging
  • Production-Ready: Robust error handling, memory optimization, and scalable training workflows

Whether you're conducting academic research or developing AI for competitive gaming, FightLadder provides the tools and infrastructure needed for advanced multi-agent reinforcement learning experiments.

About

Official repository of the paper "FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%