Skip to content

winnieyangwannan/gem

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌍 GEM: A Gym for Agentic LLMs

Paper Notion blog Hugging Face Collection 🌐 Axon-RL Documentation PyPI - Version

Overview

We’re entering the era of experience, where large language models (LLMs) learn not just from static datasets, but from interactive experience gathered in complex, expressive environments.

As a step toward this, we introduce GEM β€” a General Experience Maker for LLMs β€” an open-source environment suite designed for training agentic LLMs via online reinforcement learning.

Like OpenAI Gym for traditional RL, GEM provides a standardized API and a growing collection of diverse environments. It is training framework-agnostic and supports seamless integration with six popular RL training frameworks including Oat and Tinker, offering:

  • 🧩 Clean, composable environment APIs
  • βš™οΈ Async vectorized execution for high-throughput simulation
  • πŸ”§ Tool integration & custom wrappers
  • 🧠 Multi-environment training
  • 🎈 Ready-to-use benchmark environments and algorithms

Links

Installation

pip install -U gem-llm

Or install from source for the latest version:

git clone https://github.com/axon-rl/gem.git
cd gem
pip install -e .

Please check Getting Started for more setup details.

πŸ”₯ You can jump into examples to quickly start your agentic RL training with GEM & your favorite training framework.

Interface

GEM's interface closely follows OpenAI-Gym's API. Here's an example using the game:GuessTheNumber-v0 environment:

import gem

# List all supported environments
gem.print_envs()

# Initialize the environment
env = gem.make("game:GuessTheNumber-v0")

# Reset the environment to generate the first observation
observation, info = env.reset()

# Start the agent-environment loop
while True:
    action = env.sample_random_action() # insert policy here, e.g.,
    # (pseudocode) action = llm.generate(observation)

    # apply action and receive next observation, reward
    # and whether the episode has ended
    next_observation, reward, terminated, truncated, info = env.step(action)
    print("OBS", observation)
    print("ACT", action)

    # update the policy (online) here
    # e.g., policy = learn(policy, observation, action, reward, info)

    observation = next_observation
    # Exit when the episode terminates
    if terminated or truncated:
        break

Features

  1. Environments consist of tasks and (optional) tools. Tool-calling is achieved via an environment wrapper, as demonstrated here.
  2. GEM is training framework-agnostic, and we demonstrate its integration with six popular RL training frameworks.
  3. We provide implementations and benchmarking results for different algorithms across a diverse set of environments.

Supported Tasks

Category Example Environments Description
Games game:GuessTheNumber-v0-hard, game:Sudoku-v0-easy Classic language games
Math math:Math12K, math:DeepScaleR40K Mathematical reasoning
Code code:CodeContest, code:Taco8k Competitive coding
QA qa:NaturalQuestions, qa:HotpotQA Knowledge-intensive question answering
ReasoningGym rg:arc_1d, rg:letter_counting Diverse synthetic reasoning tasks

Supported Tools

Tool Description
Python Python code executor that parses code blocks, executes them, and returns outputs
Search Calls a search engine to retrieve documents for any query
MCP Calls the general MCP API to train tool-use agents

Supported Frameworks

Framework Description
Oat vLLM + DeepSpeed, modular, no ray
Tinker SDK provided by Thinking Machines, frees you from infra issues
Verl Support diverse backends, models, and algorithms
RL2 SGLang + FSDP, no ray, easy to hack
ROLL Support diverse backends, models, and algorithms
OpenRLHF Support diverse backends, models, and algorithms

Examples of training agents on GEM environments with all above frameworks can be found in here!

Supported Algorithms

Algorithm Description
REINFORCE A general policy gradient algorithm that can be applied to single- and multi-turn environments
GRPO Mainly for bandits (single-turn), using group advantage normalization
PPO Learns a turn-level critic to compute generalized advantage estimation (GAE)
REINFORCE + ReBN REINFORCE with return batch normalization as introduced in our paper

Please check out our paper for a more detailed description for each algorithm and empirical results showing their tradeoffs.

Contributing

We welcome all forms of contribution β€” from adding new environments to integrating additional training frameworks. We're planning to write a community-driven technical report, and major contributors will be recognized with authorship. Join discord to discuss more!

Acknowledgement

Citation

If you find our works useful for your research, please consider citing:

  • GEM paper (please prioritize citing the paper unless you believe the blog is a better fit):

    @article{liu2025gem,
      title={GEM: A Gym for Agentic LLMs},
      author={Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yu, Simon and Zhou, Xiangxin and Xu, Haotian and Xiong, Shaopan and Liu, Bo and Tan, Chenmien and others},
      journal={arXiv preprint arXiv:2510.01051},
      year={2025}
    }
  • GEM blog:

    @misc{liu2025gemblog,
      title={GEM: A Gym for Generalist LLMs},
      author={Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yang, Diyi and Lee, Wee Sun and Lin, Min},
      year={2025},
      howpublished={\url{https://axon-rl.notion.site/gem}},
      note={Notion Blog},
    }

About

A Gym for Agentic LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Makefile 0.4%