GRL: Game Reinforcement Learning for Post‑training LLMs

Game Reinforcement Learning (GRL) for post‑training large language models

GRL (Game Reinforcement Learning) is an open‑source framework that post‑trains LLMs via multi‑turn reinforcement learning on games, yielding general gains across diverse benchmarks.

News

[2025/09/29] 🚀 Tunix integration: PPO multi‑turn training now runs on TPU via JAX, with a Sokoban PPO training example. For more details, see Tunix, a JAX‑native LLM post‑training library with TPU support.

[2025/08/27] 📢 We release GRL to reproduce the paper’s results and to demonstrate general gains across benchmarks by post‑training LLMs via reinforcement learning. Read the blog post here.

📖 Table of Contents

Installation

# clone the repo
git clone --recurse-submodules https://github.com/lmgame-org/GRL.git
cd GRL

# create a conda environment
conda create --name grl python=3.12
conda activate grl

# Submodule installation (pick ONE backend)
# EITHER Tunix (TPU/JAX):
#   bash scripts/install_submodules.sh --tunix
# OR VERL (GPU/PyTorch):
#   bash scripts/install_submodules.sh --verl
# Optional WebShop tooling only:
#   bash scripts/install_submodules.sh --webshop

# install GRL
pip install -e .

# export environment variables
export WANDB_API_KEY=your_wandb_api_key
export WANDB_ENTITY=your_wandb_entity
export HF_TOKEN=your_huggingface_token

Submodule Installation

Tunix Installation (only)

Use this if you plan to run TPU/JAX with Tunix only:

bash scripts/install_submodules.sh --tunix

Verl Installation (only)

Use this if you plan to run GPU/PyTorch with VERL only. The installation script will also install the pinned GPU dependencies (Torch and FlashAttention on Linux + CUDA) for you:

bash scripts/install_submodules.sh --verl

WebShop Installation (only)

Install WebShop tooling and prerequisites only:

bash scripts/install_submodules.sh --webshop

Optional: Install Datasets

If you want to reproduce paper results and validate BIRD SQL performance or WebShop full dataset performance, use the dataset loader script:

Load all supported datasets

bash scripts/load_dataset.sh --all

Load individual datasets

bash scripts/load_dataset.sh --bird
bash scripts/load_dataset.sh --webshop

Notes:

The loader relies on scripts/load_dataset.py under the hood.
Set HF_TOKEN if any dataset requires authenticated downloads (see environment exports above).
If no dataset is selected, the script prints usage and exits.

Training Examples

Tunix Quick Test

Quickly run an end‑to‑end multi‑turn PPO rollout + training loop with Tunix (Qwen2.5‑0.5B‑Instruct on Sokoban). This uses minimal defaults and logs metrics to W&B.

Run the quick test (defaults to Qwen2.5‑0.5B; supports 4 TPU v4 with mesh (2,2))

bash tunix_quick_training_example.sh

Adjust training hyperparameters (tunix_base.yaml)

Edit configs/tunix_base.yaml to freely tune training without touching code. Key sections:

rollout: agent grouping, validation set, filtering, reward normalization
ppo: PPO knobs (epochs, minibatch, gamma, lambda, entropy, clip ratios, kl method)
training: optimizer (lr, betas, weight_decay), grad_accum, eval cadence, checkpointing
rollout_runtime: generation length and sampling (temperature, top_p/top_k)
model.repo_id: base model to download

Notes:

Set training.max_steps or training.eval_every_n_steps to positive integers to force values; use -1 to let the script compute defaults.
The script composes tunix_base.yaml with configs/agents.yaml via defaults and prints the merged configuration at startup.

Reproduce Training Results (Verl)

Uses Verl (PyTorch) on GPU.

Train on 6×6 (1‑box) Sokoban and evaluate transferability to Tetris, Blocksworld, and GSM8K.

bash verl_quick_training_example.sh

General gains of LLM ability from game RL training (paper‑reported results)

Expected Observed validation success rate curves (examples)

Note: RL training results may fluctuate relative to reported results, but the overall trend and gains remain consistent.

Sokoban Agent Training:

bash examples/sokoban_ppo/qwen_7b.sh

Tetris Agent Training:

bash examples/tetris_ppo/qwen_7b.sh

Note: BirdAgent may wait on SQLite file readiness or locks; heavy SQL can stall rollouts and prolong validation.

Hardware Configuration

GRL supports both GPU and TPU training backends:

GPU (Torch + VERL): PyTorch-based training on NVIDIA GPUs via VERL.
TPU (JAX + Tunix): JAX-based training on Google TPU via Tunix.

GPU Configurations (Torch + VERL)

GPU Type	GPUs	Agent Groups	Group Size	Total Agents	Default Model	Task
A100	1	8	16	128	Qwen/Qwen2.5-0.5B-Instruct	Sokoban
L40	1	4	8	32	Qwen/Qwen2.5-0.5B-Instruct	Sokoban
A100	8	8	16	128	Qwen/Qwen2.5-7B-Instruct	Sokoban
H200	4	8	16	128	Qwen/Qwen2.5-7B-Instruct	Sokoban
A100	8	8	16	128	Qwen/Qwen2.5-7B-Instruct	Tetris

TPU Configurations (JAX + Tunix)

TPU Type	Chips	Mesh(fsdp, tp)	Agent Groups	Group Size	Total Agents	Default Model	Task
TPU v4	4	(2,2)	8	16	128	Qwen/Qwen2.5-0.5B-Instruct	Sokoban
TPU v5P	8	(2,4)	8	16	128	Qwen/Qwen2.5-7B-Instruct	Sokoban

Note: The framework automatically scales based on available hardware. Adjust parameters in the training scripts for best performance on your setup.

Supported Games and Agents

Sokoban: Puzzle‑solving requiring spatial reasoning (agent: sokobanAgent)
Tetris: Decision‑making and planning (agent: tetrisAgent)
GSM8K: Grade‑school math reasoning (agent: gsm8kAgent)
Blocksworld: Logical planning and manipulation (agent: blocksworldAgent)
WebShop: E‑commerce navigation and decision‑making (agent: webshopAgent)
BIRD (SQL): SQL query generation and database reasoning (agent: birdAgent)
AMC 2023: Competition math problems from AMC 2023 (agent: amc23Agent)
AIME 2024: Competition math problems from AIME 2024 (agent: aime24Agent)
AIME 2025: Competition math problems from AIME 2025 (agent: aime25Agent)
Minerva Math: Advanced math reasoning dataset (agent: minervamathAgent)
Math500: Math word‑problem benchmark (agent: math500Agent)

Documentation

Tutorial - Contributing and development workflow
System Design Overview - Architecture and design principles
Development Guide - Contributing and development workflow

Acknowledgments

We gratefully acknowledge Tunix, a JAX‑native LLM post‑training library whose TPU support and JAX‑first techniques enabled us to achieve scalable multi‑turn PPO training on TPU with JAX.

Our work is also powered by VERL, and we draw valuable insights from RAGEN that informed how we train multi‑turn PPO in our experiments.

Citation

If you find this repository helpful, please kindly cite:

@article{hu2025lmgame,
  title={lmgame-Bench: How Good are LLMs at Playing Games?},
  author={Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.15146},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
configs		configs
docs		docs
examples		examples
external		external
grl		grl
scripts		scripts
tests		tests
tunix @ d8954b7		tunix @ d8954b7
verl @ 8d9e350		verl @ 8d9e350
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
code_style.sh		code_style.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tunix_quick_training_example.sh		tunix_quick_training_example.sh
verl_quick_training_example.sh		verl_quick_training_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GRL: Game Reinforcement Learning for Post‑training LLMs

News

📖 Table of Contents

Installation

Submodule Installation

Tunix Installation (only)

Verl Installation (only)

WebShop Installation (only)

Optional: Install Datasets

Training Examples

Tunix Quick Test

Run the quick test (defaults to Qwen2.5‑0.5B; supports 4 TPU v4 with mesh (2,2))

Adjust training hyperparameters (tunix_base.yaml)

Reproduce Training Results (Verl)

General gains of LLM ability from game RL training (paper‑reported results)

Expected Observed validation success rate curves (examples)

Hardware Configuration

GPU Configurations (Torch + VERL)

TPU Configurations (JAX + Tunix)

Supported Games and Agents

Documentation

Acknowledgments

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

lmgame-org/GRL

Folders and files

Latest commit

History

Repository files navigation

GRL: Game Reinforcement Learning for Post‑training LLMs

News

📖 Table of Contents

Installation

Submodule Installation

Tunix Installation (only)

Verl Installation (only)

WebShop Installation (only)

Optional: Install Datasets

Training Examples

Tunix Quick Test

Run the quick test (defaults to Qwen2.5‑0.5B; supports 4 TPU v4 with mesh (2,2))

Adjust training hyperparameters (tunix_base.yaml)

Reproduce Training Results (Verl)

General gains of LLM ability from game RL training (paper‑reported results)

Expected Observed validation success rate curves (examples)

Hardware Configuration

GPU Configurations (Torch + VERL)

TPU Configurations (JAX + Tunix)

Supported Games and Agents

Documentation

Acknowledgments

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages