Anonymous code release for literature-grounded scientific insight prediction.
Insight anticipation asks a model to predict a downstream paper's core contribution from the summaries of its two foundational parent papers. This repository contains the training and reward code used for our reinforcement-learning experiments, packaged as a clean, submission-ready release on top of the verl training stack.
| Component | Purpose | Entry point |
|---|---|---|
| SFT | Bootstrap a policy on parent-summary to insight pairs | scripts/train_sft.sh |
| GRPO | Optimize the policy with similarity-based rewards | scripts/grpo/train_grpo.sh |
| Reward judge | Score generated insights against ground truth | verl/utils/reward_score/insight_similarity/compute_score.py |
| Core training stack | Distributed training and rollout infrastructure | verl/ |
The full reproduction path is four steps. Each step assumes you are in the repo root with the insight-anticipation conda env activated (see Step 1).
Step 1: Install environment
Step 2: Prepare SFT + RL data from Hugging Face
Step 3: Run SFT to get a bootstrapped policy
Step 4: Run GRPO from the SFT checkpoint
Requires Python 3.10 and CUDA 12.4 (the pinned torch==2.6.0 and flashinfer wheels are cu124-specific).
conda create -n insight-anticipation python=3.10 -y
conda activate insight-anticipation
pip install -r requirements.txt
pip install -e .All package versions are pinned in requirements.txt to match the env used for the paper. If flash_attn or xformers fail to build, install torch==2.6.0 first, then re-run pip install -r requirements.txt.
The public dataset lives at giants2026/GiantsBench-train. If it is gated, authenticate first with huggingface-cli login or export HF_TOKEN=....
SFT splits (query / completion columns):
python scripts/prepare_sft_data.py \
--dataset giants2026/GiantsBench-train \
--output-dir data/insight_anticipation_sftGRPO splits (verl RL schema: prompt / reward_model.ground_truth / extra_info):
python scripts/prepare_rl_data.py \
--dataset giants2026/GiantsBench-train \
--output-dir data/insight_anticipation_grpo \
--drop-empty-insightsprepare_rl_data.py wraps each query into a single-turn chat prompt and extracts the <insight>...</insight> block from completion as reward_model.ground_truth. Both scripts auto-carve a deterministic 3% test split (seed 42) when the source has no test split; override via --test-size / --seed if needed.
You should end up with:
data/
insight_anticipation_sft/
train.parquet
test.parquet
insight_anticipation_grpo/
train.parquet
test.parquet
See verl/utils/dataset/README.md for the base RL dataset contract.
4-GPU SFT launcher:
BASE_MODEL=Qwen/Qwen3-4B \
TRAIN_DATA_DIR=$PWD/data/insight_anticipation_sft \
EVAL_DATA_DIR=$PWD/data/insight_anticipation_sft \
GPU_IDS=0,1,2,3 \
EXPERIMENT_NAME=qwen3-4b-sft \
TRAINER_DEFAULT_LOCAL_DIR=$PWD/outputs/sft \
bash scripts/train_sft.shKey knobs:
BASE_MODEL— Hugging Face model ID or local checkpoint.GPU_IDS— Comma-separated GPU list.TRAINER_DEFAULT_LOCAL_DIR— Output directory for checkpoints and logs.TRAINER_LOGGERS— Hydra list string, e.g.['console']or['console','wandb'].
The resulting checkpoint path (under TRAINER_DEFAULT_LOCAL_DIR/EXPERIMENT_NAME) becomes the BASE_MODEL for Step 4.
Before launching, configure the reward judge (Gemini via API key or Vertex AI):
# Option A: Gemini API key
export GEMINI_API_KEY=...
# Option B: Vertex AI
export GOOGLE_CLOUD_PROJECT=your-project
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonThen launch the 4-GPU GRPO run from the SFT checkpoint:
BASE_MODEL=$PWD/outputs/sft/qwen3-4b-sft \
TRAIN_DATA_DIR=$PWD/data/insight_anticipation_grpo \
EVAL_DATA_DIR=$PWD/data/insight_anticipation_grpo \
GPU_IDS=0,1,2,3 \
EXPERIMENT_NAME=qwen3-4b-grpo-similarity \
ROLLOUT_TP_SIZE=1 \
bash scripts/grpo/train_grpo.shExposed hyperparameters:
MAX_PROMPT_LENGTH,MAX_MODEL_LENROLLOUT_N,ROLLOUT_TP_SIZEACTOR_LR,TRAIN_BATCH_SIZE,TOTAL_TRAINING_STEPS
Optional reward controls:
INSIGHT_SIMILARITY_MODEL— defaults togemini-2.5-flash.INSIGHT_SIMILARITY_MAX_TOKENS— defaults to8192.INSIGHT_SIMILARITY_DEBUG_DIR— writes prompt/response traces for debugging.
This project builds on the verl reinforcement-learning stack. Upstream license and notices are preserved in this repository.