Official codebase for the ICLR 2026 Oral paper: Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS is a training and evaluation framework for reasoning models, with two core ideas:
- Decoupled Rewards to reduce inefficient overthinking behaviors during RL training.
- Curriculum Data Scheduling to improve stability and generalization by progressively controlling training data difficulty.
scripts/train/local/train_rl_chunk_local.sh: main DECS RL training script (chunk reward + decoupled configs)scripts/train/local/train_thinkprune_local.sh: thinkprune training scriptscripts/eval/local/sc_local.sh: standard self-consistency inference/evaluationscripts/eval/local/prolong_gen_local.sh: prolonged generation for long trajectoriesscripts/eval/local/test_local.sh: quick wrapper forsc_local.shscripts/eval/local/prolong_gen_test_local.sh: quick wrapper forprolong_gen_local.sh
First create a virtual environment:
conda create -n verl python=3.10
conda activate verlThen install vllm==0.8.5.post1 via
export VLLM_VERSION=0.8.5.post1
export CUDA_VERSION=124 # or other
export CPU_ARCH=$(uname -m) # x86_64 or aarch64
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}After that, install verl via
pip install -e .
We have organized the training data and evaluaion data in the data folder. For LiveCodeBench dataset, please refer to here for detailed configuration.
# Select GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Root directory for model checkpoints
export CHECKPOINT_ROOT=checkpointsDownload the NRP DETECTOR from https://huggingface.co/pixas/DECS_NRP_DETECTOR, put it at checkpoints directory and deploy it via
vllm serve --model checkpoints/DECS_NRP_DETECTOR --port 10041
After that, run the training script
bash scripts/train/local/train_rl_chunk_local.shCommon overrides:
MODEL_NAME=r1_distill_qwen1.5b \
DATA_NAME="deepscaler" \
ROLLOUT_N=16 \
CHUNK_JUDGE_URL=127.0.0.1:10041 \
bash scripts/train/local/train_rl_chunk_local.shTraining logs are written to:
logs/<data_tag>/<save_name>/train.log
Checkpoints are saved under (controlled by src_valid/config/ppo_trainer.yaml):
checkpoints/verl_math/<experiment_name>/global_step_*/actor
Assume your trained checkpoint is:
checkpoints/verl_math/<exp>/global_step_xxx/actor
MODEL_CKPT="checkpoints/verl_math/<exp>/global_step_xxx/actor"
HF_MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
bash scripts/eval/local/sc_local.sh \
"${MODEL_CKPT}" \
1 \
16 \
1 \
"--prompt_type instruct_default --max_new_tokens 16384 --hf_model_path ${HF_MODEL_PATH}"Parameter meanings:
- arg2 =
chunk_num - arg3 =
sc_size - arg4 =
tp_size
Default dataset in sc_local.sh is math. To evaluate multiple datasets:
export DATASETS="aime2024 aime2025 amc23 math"bash scripts/eval/local/prolong_gen_local.sh \
"${MODEL_CKPT}" \
16 \
32768 \
"--max_new_tokens 32768 --hf_model_path ${HF_MODEL_PATH}"Standard SC outputs:
results/<dataset>/<model_name>_sc<k>/cache.jsonlresults/<dataset>/<model_name>_sc<k>/result.json
Prolonged-generation outputs:
results/<dataset>/<model_name>_sc<k>_prolong<length>/cache.jsonlresults/<dataset>/<model_name>_sc<k>_prolong<length>/result.json
bash scripts/eval/local/test_local.sh <model_ckpt_path> <hf_model_path> [tp_size]bash scripts/eval/local/prolong_gen_test_local.sh <model_ckpt_path> <hf_model_path> [sc_size] [prolong_length]If you find our work useful, please cite our work as:
@inproceedings{jiang2026decs,
title = {Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling},
author = {Jiang, Shuyang and Tao, Xiaofeng and Zhang, Kui and Xiao, Yanghua},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
note = {Oral},
url = {https://arxiv.org/abs/2509.25827}
}