My implementations of RL algorithms like GRPO/GSPO with minimal code.
- Supported models
- Qwen2/Qwen2.5/Qwen3 language models
- Qwen2.5 vision language models
- Supported algorithms
- GRPO
- Dr-GRPO
- GSPO
- KL-Conv
- StableReinforce
- Supported tricks
- clip higher from DAPO
- token level policy loss
- dual clip
- kl term removal
pip install -r requirements
bash scripts/run_agent_math.sh
https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
| Model | MATH500 | AIME24 | AIME25 | HMMT Feb.25 | BeyondAIME | AVG |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct + sft | 0.619/0.884 | 0.105/0.5 | 0.065/0.4 | 0.047/0.367 | 0.028/0.25 | 0.173/0.48 |
| Qwen2.5-7B-Instruct + rl | 0.718/0.91 | 0.261/0.6 | 0.227/0.533 | 0.125/0.367 | 0.068/0.33 | 0.28/0.548 |
| Qwen3-4B-Instruct + sft | 0.777/0.916 | 0.268/0.7 | 0.202/0.333 | 0.130/0.5 | 0.109/0.44 | 0.297/0.578 |
| Qwen3-4B-Instruct + rl | 0.843/0.928 | 0.442/0.8 | 0.385/0.733 | 0.239/0.6 | 0.209/0.52 | 0.424/0.716 |
bash scripts/run_logic.sh
https://github.com/Unakar/Logic-RL/tree/main/data/kk/instruct
| Size | Algorithm | Bits | LR | KL | Group Size | Steps | Test Score |
|---|---|---|---|---|---|---|---|
| 3B | GRPO | AMP | 1e-6 | 0 | 8 | 1600 | 0.12->0.54 |
| 7B | GRPO | AMP | 1e-6 | 0 | 8 | 1350 | 0.23->0.89 |
| Model | 2ppl | 3ppl | 4ppl | 5ppl | 6ppl | 7ppl | 8ppl |
|---|---|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 0.37 | 0.13 | 0.17 | 0.12 | 0.04 | 0.02 | 0.02 |
| Qwen2.5-3B-Instruct-GRPO | 0.76 | 0.70 | 0.68 | 0.50 | 0.47 | 0.33 | 0.33 |
| Qwen2.5-7B-Instruct | 0.56 | 0.35 | 0.23 | 0.25 | 0.14 | 0.09 | 0.02 |
| Qwen2.5-7B-Instruct-GRPO | 0.97 | 0.96 | 0.96 | 0.94 | 0.88 | 0.79 | 0.72 |
bash scripts/run_geometry3k.sh
https://huggingface.co/datasets/hiyouga/geometry3k
| Size | Algorithm | Bits | LR | KL | Group Size | Steps | Test Score |
|---|---|---|---|---|---|---|---|
| 3B | GRPO | AMP | 1e-6 | 0 | 8 | 700 | 0.24->0.43 |
| 3B | GSPO | AMP | 1e-6 | 0 | 8 | 750 | 0.24->0.43 |
| 3B | stable reinforce | AMP | 1e-6 | 0 | 12 | 1200 | 0.25->0.44 |
| 3B | kl-conv | AMP | 1e-6 | 0 | 12 | 900 | 0.23->0.45 |
| 7B | GRPO | AMP | 1e-6 | 0 | 8 | 800 | 0.38->0.50 |
train on mathtrain with Dr-GRPO/GSPO/KL-Conv/StableReinforce algos- support dynamic sampling from dapo
- support ppo/reinforce++/RLOO
support vision language models- support Retrieval-Augmented Reasoning
support agent training- support code eval
- support vllm infer parallelism(dp2, tp4)
- add ulysses sequence parallelism
- support padding free training