Skip to content
/ GRL Public

Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning

License

Notifications You must be signed in to change notification settings

lmgame-org/GRL

Repository files navigation

GRL: Game Reinforcement Learning for Post‑training LLMs

Game Reinforcement Learning (GRL) for post‑training large language models


Github Website Blog arXiv X (Twitter) Discord

GRL (Game Reinforcement Learning) is an open‑source framework that post‑trains LLMs via multi‑turn reinforcement learning on games, yielding general gains across diverse benchmarks.

News

[2025/09/29] 🚀 Tunix integration: PPO multi‑turn training now runs on TPU via JAX, with a Sokoban PPO training example. For more details, see Tunix, a JAX‑native LLM post‑training library with TPU support.

[2025/08/27] 📢 We release GRL to reproduce the paper’s results and to demonstrate general gains across benchmarks by post‑training LLMs via reinforcement learning. Read the blog post here.

📖 Table of Contents

Installation

# clone the repo
git clone --recurse-submodules https://github.com/lmgame-org/GRL.git
cd GRL

# create a conda environment
conda create --name grl python=3.12
conda activate grl

# Submodule installation (pick ONE backend)
# EITHER Tunix (TPU/JAX):
#   bash scripts/install_submodules.sh --tunix
# OR VERL (GPU/PyTorch):
#   bash scripts/install_submodules.sh --verl
# Optional WebShop tooling only:
#   bash scripts/install_submodules.sh --webshop

# install GRL
pip install -e .

# export environment variables
export WANDB_API_KEY=your_wandb_api_key
export WANDB_ENTITY=your_wandb_entity
export HF_TOKEN=your_huggingface_token

Submodule Installation

Tunix Installation (only)

Use this if you plan to run TPU/JAX with Tunix only:

bash scripts/install_submodules.sh --tunix

Verl Installation (only)

Use this if you plan to run GPU/PyTorch with VERL only. The installation script will also install the pinned GPU dependencies (Torch and FlashAttention on Linux + CUDA) for you:

bash scripts/install_submodules.sh --verl

WebShop Installation (only)

Install WebShop tooling and prerequisites only:

bash scripts/install_submodules.sh --webshop

Optional: Install Datasets

If you want to reproduce paper results and validate BIRD SQL performance or WebShop full dataset performance, use the dataset loader script:

  • Load all supported datasets
bash scripts/load_dataset.sh --all
  • Load individual datasets
bash scripts/load_dataset.sh --bird
bash scripts/load_dataset.sh --webshop

Notes:

  • The loader relies on scripts/load_dataset.py under the hood.
  • Set HF_TOKEN if any dataset requires authenticated downloads (see environment exports above).
  • If no dataset is selected, the script prints usage and exits.

Training Examples

Tunix Quick Test

Quickly run an end‑to‑end multi‑turn PPO rollout + training loop with Tunix (Qwen2.5‑0.5B‑Instruct on Sokoban). This uses minimal defaults and logs metrics to W&B.

Run the quick test (defaults to Qwen2.5‑0.5B; supports 4 TPU v4 with mesh (2,2))

bash tunix_quick_training_example.sh

Adjust training hyperparameters (tunix_base.yaml)

Edit configs/tunix_base.yaml to freely tune training without touching code. Key sections:

  • rollout: agent grouping, validation set, filtering, reward normalization
  • ppo: PPO knobs (epochs, minibatch, gamma, lambda, entropy, clip ratios, kl method)
  • training: optimizer (lr, betas, weight_decay), grad_accum, eval cadence, checkpointing
  • rollout_runtime: generation length and sampling (temperature, top_p/top_k)
  • model.repo_id: base model to download

Notes:

  • Set training.max_steps or training.eval_every_n_steps to positive integers to force values; use -1 to let the script compute defaults.
  • The script composes tunix_base.yaml with configs/agents.yaml via defaults and prints the merged configuration at startup.

Reproduce Training Results (Verl)

Uses Verl (PyTorch) on GPU.

Train on 6×6 (1‑box) Sokoban and evaluate transferability to Tetris, Blocksworld, and GSM8K.

bash verl_quick_training_example.sh

General gains of LLM ability from game RL training (paper‑reported results)

Table 4: Model performance on diverse tasks

Expected Observed validation success rate curves (examples)

Examples of observed validation success rate curves

Note: RL training results may fluctuate relative to reported results, but the overall trend and gains remain consistent.

Sokoban Agent Training:

bash examples/sokoban_ppo/qwen_7b.sh

Tetris Agent Training:

bash examples/tetris_ppo/qwen_7b.sh

Note: BirdAgent may wait on SQLite file readiness or locks; heavy SQL can stall rollouts and prolong validation.

Hardware Configuration

GRL supports both GPU and TPU training backends:

  • GPU (Torch + VERL): PyTorch-based training on NVIDIA GPUs via VERL.
  • TPU (JAX + Tunix): JAX-based training on Google TPU via Tunix.

GPU Configurations (Torch + VERL)

GPU Type GPUs Agent Groups Group Size Total Agents Default Model Task
A100 1 8 16 128 Qwen/Qwen2.5-0.5B-Instruct Sokoban
L40 1 4 8 32 Qwen/Qwen2.5-0.5B-Instruct Sokoban
A100 8 8 16 128 Qwen/Qwen2.5-7B-Instruct Sokoban
H200 4 8 16 128 Qwen/Qwen2.5-7B-Instruct Sokoban
A100 8 8 16 128 Qwen/Qwen2.5-7B-Instruct Tetris

TPU Configurations (JAX + Tunix)

TPU Type Chips Mesh(fsdp, tp) Agent Groups Group Size Total Agents Default Model Task
TPU v4 4 (2,2) 8 16 128 Qwen/Qwen2.5-0.5B-Instruct Sokoban
TPU v5P 8 (2,4) 8 16 128 Qwen/Qwen2.5-7B-Instruct Sokoban

Note: The framework automatically scales based on available hardware. Adjust parameters in the training scripts for best performance on your setup.

Supported Games and Agents

  • Sokoban: Puzzle‑solving requiring spatial reasoning (agent: sokobanAgent)
  • Tetris: Decision‑making and planning (agent: tetrisAgent)
  • GSM8K: Grade‑school math reasoning (agent: gsm8kAgent)
  • Blocksworld: Logical planning and manipulation (agent: blocksworldAgent)
  • WebShop: E‑commerce navigation and decision‑making (agent: webshopAgent)
  • BIRD (SQL): SQL query generation and database reasoning (agent: birdAgent)
  • AMC 2023: Competition math problems from AMC 2023 (agent: amc23Agent)
  • AIME 2024: Competition math problems from AIME 2024 (agent: aime24Agent)
  • AIME 2025: Competition math problems from AIME 2025 (agent: aime25Agent)
  • Minerva Math: Advanced math reasoning dataset (agent: minervamathAgent)
  • Math500: Math word‑problem benchmark (agent: math500Agent)

Documentation

Acknowledgments

We gratefully acknowledge Tunix, a JAX‑native LLM post‑training library whose TPU support and JAX‑first techniques enabled us to achieve scalable multi‑turn PPO training on TPU with JAX.

Our work is also powered by VERL, and we draw valuable insights from RAGEN that informed how we train multi‑turn PPO in our experiments.

Citation

If you find this repository helpful, please kindly cite:

@article{hu2025lmgame,
  title={lmgame-Bench: How Good are LLMs at Playing Games?},
  author={Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.15146},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published