Game Reinforcement Learning (GRL) for post‑training large language models
GRL (Game Reinforcement Learning) is an open‑source framework that post‑trains LLMs via multi‑turn reinforcement learning on games, yielding general gains across diverse benchmarks.
[2025/09/29] 🚀 Tunix integration: PPO multi‑turn training now runs on TPU via JAX, with a Sokoban PPO training example. For more details, see Tunix, a JAX‑native LLM post‑training library with TPU support.
[2025/08/27] 📢 We release GRL to reproduce the paper’s results and to demonstrate general gains across benchmarks by post‑training LLMs via reinforcement learning. Read the blog post here.
- News
- Installation
- Training Examples
- Supported Games and Agents
- Hardware Configuration
- Documentation
- Acknowledgments
- Citation
- License
# clone the repo
git clone --recurse-submodules https://github.com/lmgame-org/GRL.git
cd GRL
# create a conda environment
conda create --name grl python=3.12
conda activate grl
# Submodule installation (pick ONE backend)
# EITHER Tunix (TPU/JAX):
# bash scripts/install_submodules.sh --tunix
# OR VERL (GPU/PyTorch):
# bash scripts/install_submodules.sh --verl
# Optional WebShop tooling only:
# bash scripts/install_submodules.sh --webshop
# install GRL
pip install -e .
# export environment variables
export WANDB_API_KEY=your_wandb_api_key
export WANDB_ENTITY=your_wandb_entity
export HF_TOKEN=your_huggingface_token
Use this if you plan to run TPU/JAX with Tunix only:
bash scripts/install_submodules.sh --tunix
Use this if you plan to run GPU/PyTorch with VERL only. The installation script will also install the pinned GPU dependencies (Torch and FlashAttention on Linux + CUDA) for you:
bash scripts/install_submodules.sh --verl
Install WebShop tooling and prerequisites only:
bash scripts/install_submodules.sh --webshop
If you want to reproduce paper results and validate BIRD SQL performance or WebShop full dataset performance, use the dataset loader script:
- Load all supported datasets
bash scripts/load_dataset.sh --all
- Load individual datasets
bash scripts/load_dataset.sh --bird
bash scripts/load_dataset.sh --webshop
Notes:
- The loader relies on
scripts/load_dataset.py
under the hood. - Set
HF_TOKEN
if any dataset requires authenticated downloads (see environment exports above). - If no dataset is selected, the script prints usage and exits.
Quickly run an end‑to‑end multi‑turn PPO rollout + training loop with Tunix (Qwen2.5‑0.5B‑Instruct on Sokoban). This uses minimal defaults and logs metrics to W&B.
bash tunix_quick_training_example.sh
Edit configs/tunix_base.yaml
to freely tune training without touching code. Key sections:
- rollout: agent grouping, validation set, filtering, reward normalization
- ppo: PPO knobs (epochs, minibatch, gamma, lambda, entropy, clip ratios, kl method)
- training: optimizer (lr, betas, weight_decay), grad_accum, eval cadence, checkpointing
- rollout_runtime: generation length and sampling (temperature, top_p/top_k)
- model.repo_id: base model to download
Notes:
- Set
training.max_steps
ortraining.eval_every_n_steps
to positive integers to force values; use-1
to let the script compute defaults. - The script composes
tunix_base.yaml
withconfigs/agents.yaml
via defaults and prints the merged configuration at startup.
Uses Verl (PyTorch) on GPU.
Train on 6×6 (1‑box) Sokoban and evaluate transferability to Tetris, Blocksworld, and GSM8K.
bash verl_quick_training_example.sh
Note: RL training results may fluctuate relative to reported results, but the overall trend and gains remain consistent.
Sokoban Agent Training:
bash examples/sokoban_ppo/qwen_7b.sh
Tetris Agent Training:
bash examples/tetris_ppo/qwen_7b.sh
Note: BirdAgent may wait on SQLite file readiness or locks; heavy SQL can stall rollouts and prolong validation.
GRL supports both GPU and TPU training backends:
- GPU (Torch + VERL): PyTorch-based training on NVIDIA GPUs via VERL.
- TPU (JAX + Tunix): JAX-based training on Google TPU via Tunix.
GPU Type | GPUs | Agent Groups | Group Size | Total Agents | Default Model | Task |
---|---|---|---|---|---|---|
A100 | 1 | 8 | 16 | 128 | Qwen/Qwen2.5-0.5B-Instruct | Sokoban |
L40 | 1 | 4 | 8 | 32 | Qwen/Qwen2.5-0.5B-Instruct | Sokoban |
A100 | 8 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Sokoban |
H200 | 4 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Sokoban |
A100 | 8 | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Tetris |
TPU Type | Chips | Mesh(fsdp, tp) | Agent Groups | Group Size | Total Agents | Default Model | Task |
---|---|---|---|---|---|---|---|
TPU v4 | 4 | (2,2) | 8 | 16 | 128 | Qwen/Qwen2.5-0.5B-Instruct | Sokoban |
TPU v5P | 8 | (2,4) | 8 | 16 | 128 | Qwen/Qwen2.5-7B-Instruct | Sokoban |
Note: The framework automatically scales based on available hardware. Adjust parameters in the training scripts for best performance on your setup.
- Sokoban: Puzzle‑solving requiring spatial reasoning (agent:
sokobanAgent
) - Tetris: Decision‑making and planning (agent:
tetrisAgent
) - GSM8K: Grade‑school math reasoning (agent:
gsm8kAgent
) - Blocksworld: Logical planning and manipulation (agent:
blocksworldAgent
) - WebShop: E‑commerce navigation and decision‑making (agent:
webshopAgent
) - BIRD (SQL): SQL query generation and database reasoning (agent:
birdAgent
) - AMC 2023: Competition math problems from AMC 2023 (agent:
amc23Agent
) - AIME 2024: Competition math problems from AIME 2024 (agent:
aime24Agent
) - AIME 2025: Competition math problems from AIME 2025 (agent:
aime25Agent
) - Minerva Math: Advanced math reasoning dataset (agent:
minervamathAgent
) - Math500: Math word‑problem benchmark (agent:
math500Agent
)
- Tutorial - Contributing and development workflow
- System Design Overview - Architecture and design principles
- Development Guide - Contributing and development workflow
We gratefully acknowledge Tunix, a JAX‑native LLM post‑training library whose TPU support and JAX‑first techniques enabled us to achieve scalable multi‑turn PPO training on TPU with JAX.
Our work is also powered by VERL, and we draw valuable insights from RAGEN that informed how we train multi‑turn PPO in our experiments.
If you find this repository helpful, please kindly cite:
@article{hu2025lmgame,
title={lmgame-Bench: How Good are LLMs at Playing Games?},
author={Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
journal={arXiv preprint arXiv:2505.15146},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.