Skip to content

HKUDS/LightReasoner

Repository files navigation

lightreasoner-logo
πŸ’‘ LightReasoner: Can SMALL Language Models Teach LARGE Language Models Reasoning?

Welcome banner



Figure 1: LightReasoner delivers superior performance with remarkable token efficiency - achieving consistent improvements in zero-shot pass@1 accuracy while dramatically reducing computational overhead by 90% in total time, 80% in sampled problems, and 99% in tuned tokens compared to traditional SFT.

πŸ’‘ Key Insight:

This efficiency breakthrough shows that strategic token selection, rather than exhaustive training, most effectively unlocks the latent potential of LLM reasoning β€” proving that smarter, not blindly harder is the path to scalable AI improvement.


πŸŽ‰ News

  • [2025/10/14] πŸš€ New Release: LRsamples β€” Pre-collected LightReasoner training samples ready for immediate fine-tuning. This dataset enables direct model training without requiring the full sampling pipeline, streamlining reproduction efforts and accelerating downstream research workflows.
  • [2025/10/14] πŸš€ New Release: LightReasoner Enhanced Models now available on πŸ€— Hugging Face Hub. Ready-to-use models fine-tuned with our efficient reasoning enhancement approach for immediate deployment and experimentation.
  • [2025/10/12] πŸš€ New Release: Core implementation with Qwen2.5-Math and DeepSeek-R1 models.

⚑ TL;DR

✨ LightReasoner ✨ flips the script on AI training β€” small language models (SLMs) don’t just learn from large ones (LLMs); they can actually teach LLMs better and faster!

πŸ”₯ The Challenge:

Supervised Fine-Tuning (SFT) struggles with three core bottlenecks:

  • πŸ“Š Data-Intensive: Relies on human-labeled or rejection-sampled datasets.

  • βš–οΈ Uniform Learning: Trains all tokens equally, even though only a small portion truly matter.

  • πŸ”— Ground-Truth Dependency: Hinders adaptability to new domains and reasoning formats.

πŸ” Key Insight:

We allocate 90% of compute to what models already know, while under-investing in the critical 10% that truly drives breakthroughs.

πŸ“ˆ LightReasoner: Better and Faster

Tested across 7 benchmarks Γ— 5 models

πŸš€ Performance Gains

LightReasoner consistently boosts reasoning accuracy across multiple datasets:

  • πŸ“ˆ Qwen2.5-Math-1.5B: +28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV

  • πŸ“ˆ DeepSeek-R1-Distill-Qwen-1.5B: +4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench

  • πŸ“ˆ Qwen2.5-Math-7B: +10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV

  • πŸ“ˆ Qwen2.5-Math-1.5B-Instruct: +1.9% on GSM8K, +2.6% on Minerva Math

  • 🌍 Strong generalization: Trained only on GSM8K, yet improves across 7 benchmarks

⚑ Efficiency Breakthrough

Taking Qwen2.5-Math-1.5B as an example, LightReasoner achieves dramatic efficiency gains compared with SFT:

  • ⏱️ 90% less total time: 4 hours β†’ 0.5 hours

  • 🧾 80% fewer sampled problems: 3,952 β†’ 1,000 problems

  • πŸ”’ 99% fewer tuned tokens: 1.77M β†’ 20K tokens

🌟 Key Features

  • 🎯 SLM–LLM Teaching:

    Counterintuitively uses smaller β€œamateur” models to identify critical reasoning moments where stronger β€œexpert” models should focus their learning.

  • ⚑ Extreme Token Efficiency:

    Achieves 99% fewer tuned tokens than SFT by selectively optimizing high-impact reasoning steps instead of training uniformly on full trajectories.

  • πŸ”„ Three-Stage Lightweight Framework:

    (1) Critical step selection via Expert-Amaeteur KLD detection

    (2) Contrastive supervision capturing expert-amateur behavioral differentials

    (3) Self-distillation for internalizing expert strengths

  • πŸ“ˆ KL-Guided Learning:

    Leverages behavioral divergence between expert and amateur predictions to pinpoint reasoning bottlenecks β€” all without requiring ground-truth labels.

  • 🧠 Expertise Over Scale:

    Demonstrates that domain expertise gaps, rather than model size, drive effective contrast β€” even same-sized models with different knowledge can generate powerful teaching signals.


🧩 LightReasoner Framework


Figure 2: Overview of the LightReasoner framework. (1) Sampling Stage: Expert and Amateur models generate distributions Ο€E and Ο€A. Informative step selection retains steps with DKL(Ο€E βˆ₯ Ο€A) > Ξ², and contrastive supervision constructs soft labels vC capturing the Expert's advantage through Expert–Amateur contrast. (2) Fine-tuning Stage: The Expert model is enhanced by minimizing the KL divergence between its output and vC.


πŸš€ Quick Start

LightReasoner is incredibly easy to use. We’ve designed it to be accessible β€” so anyone can try it out and experience its β€œcounterintuitive effectiveness” firsthand. No sweat β€” you’ll have it set up and running with your model of choice in just a few πŸͺ„ simple steps below!

πŸ“¦ Get Ready

git clone https://github.com/HKUDS/LightReasoner.git
cd LightReasoner

1️⃣ Install all dependencies:

pip install -r requirements.txt

2️⃣ Download the Expert and Amateur models of your choice. For example:

πŸ¦‰ Expert Model

huggingface-cli download Qwen/Qwen2.5-Math-1.5B --local-dir ./Qwen2.5-Math-1.5B

🐣 Amateur Model

huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ./Qwen2.5-0.5B

3️⃣ Prepare the training data:

python data_prep.py

⚠️ Caveat

LightReasoner relies on Expert–Amateur model pairing to generate supervision signals. Thus, the choice of this pair is crucial to the method’s success.

βš–οΈ Rule of Thumb:

The Expert should significantly outperform the Amateur, while the Amateur must remain competent enough to produce coherent reasoning. In practice, performance peaks at a balanced β€œsweet spot” rather than simply widening the capability gap.

In our experiments, the Experts include Qwen2.5-Math-1.5B, 7B, their Instruct counterparts, and DeepSeek-R1-Distill variants. The Amateur is fixed as Qwen2.5-0.5B, which offers strong contrast while maintaining sufficient reasoning ability to yield meaningful signals.

You’re encouraged to explore other model families (e.g., Llama), but keep this balance principle in mind when setting up your Expert–Amateur collaboration.

πŸ“‹ Note

  • We use GSM8K by default for its emphasis on step-by-step, broadly applicable logical reasoning rather than domain-specific notation. This ensures that the Amateur, despite lacking math-specific training, can still produce interpretable outputs suitable for contrastive supervision.

  • You’re absolutely free to try other datasets β€” LightReasoner is fully adaptable. However, depending on your dataset, you may need to adjust hyperparameters and the choice of Amateur model to ensure stable training and meaningful contrasts.

    • For instance, if you experiment with the MATH dataset β€” a collection of high-school competition problems that are significantly harder than GSM8K β€” it’s recommended to upgrade the Amateur model from a generic Qwen2.5 base model to the specialized Qwen2.5-Math variant. The base models were not math-pretrained and may struggle to produce coherent outputs on MATH, potentially destabilizing the expert–amateur contrast.

    • The balance principle still applies here β€” the Amateur should be adequately weaker than the Expert to produce a clear contrast, yet capable enough to maintain coherent reasoning.


🎯 Sampling

This step builds the LightReasoner supervision dataset for downstream fine-tuning. Steps with high Expert-Amateur KLD are retained. These selected steps are transformed into supervision examples that encode the Expert’s strengths through distributional contrast. For full details, please see our paper.

python LightR_sampling.py --max_questions 1000

πŸ“‹ Note

Before running the script, you should:

  • Update the config section with your own relative paths.

  • Adjust the maximum number of problems to control the size of your supervision dataset, tweak the sampling parameters to explore more optimal combinations, and tune the batch size based on your available compute resources.

    • To give you a rough picture, in practice, we find that sampling 1,000 problems from the GSM8K training set (with the filtering threshold Ξ² = 0.4) yields approximately 20,000 LightReasoner contrastive samples, which is already sufficient for LoRA fine-tuning to converge on the baseline models we tested.

⚑ Shortcut

To save you the trouble of running the sampling pipeline β€” which, even though much lighter and easier with LightReasoner, can still be daunting for those without ample compute power β€” we now provide ready-to-go LightReasoner samples that let you jump straight to the fine-tuning stage! πŸš€

You can find the following pre-collected LightReasoner sampling datasets in the zip file under LRsamples:

  • LR_Qwen7_gsm8k β€” for Qwen2.5-Math-7B

  • LR_ds1.5_gsm8k β€” for DeepSeek-R1-Distill-Qwen-1.5B

  • LR_Qwen1.5_gsm8k β€” for Qwen2.5-Math-1.5B

    • We provide two versions, one sampled with Torch 3.1 and another with Torch 3.8, as we found that the sampling results (i.e., the model’s generated outputs) can slightly vary across Torch versions.

    • The performance fluctuation is minimal β€” typically within 2–3%, with later Torch versions usually performing slightly better.

These datasets make it much easier to reproduce our results directly β€” no additional sampling required! ✨


βš™οΈ Fine-tuning

This step launches the full LightReasoner fine-tuning pipeline β€” combining dataset loading, LoRA configuration, and contrastive KLD training into a unified workflow.

πŸ’» Run Options

Foreground (simple run):

python LightR_finetuning.py

Background (recommended for long training):

nohup python LightR_finetuning.py > finetune.log 2>&1 &

Monitor progress:

tail -f finetune.log

⚠️ Caveat

The expert model used for fine-tuning must be identical to the one used during sampling β€” this alignment is essential for correct behavior.

πŸ“‹ Note

Before running the script, edit the config section to match your setup:

  • πŸ”Ή Replace <path_to_expert_model> with your base model path (e.g., "./Qwen2.5-Math-7B" or a local folder).

  • πŸ”Ή Replace <path_to_training_dataset> with your dataset JSONL file.

  • πŸ”Ή Replace <output_directory> with the directory where checkpoints and the final model will be saved.

  • πŸ”Ή Set torch_dtype according to your hardware (e.g., torch.bfloat16 for H100, torch.float16 for A100).


πŸ”— Model Merging

Use this step to merge the full model (base + LoRA) locally, so it behaves as a standalone model without any LoRA dependency.

python merge.py

πŸ“‹ Note

Before running the merge script, update the config section with your own paths:

  • πŸ”Ή base_model_path to your base model directory (e.g., ./Qwen2.5-Math-7B)

  • πŸ”Ή lora_ckpt_path to your LoRA checkpoint directory (e.g., ./ft_qw7_gsm8k/checkpoint-1000)

  • πŸ”Ή merged_model_path to where you want the merged model to be saved (e.g., ./ft-7B-merged)


πŸ“ˆ Evaluation

All evaluations are performed using the official Qwen2.5-Math toolkit.

Please refer to the evaluation folder for detailed usage and setup instructions.


πŸ“Š Main Results

Model GSM8K MATH SVAMP ASDiv Minerva Math Olympiad Bench MMLU STEM AVG.
Qwen2.5-Math-1.5B
Baseline 42.5 34.2 68.8 68.1 9.9 23.7 49.8 42.4
+ SFT 69.2 57.1 64.1 70.2 15.1 27.6 47.7 50.1
+ LightR 70.6 59.3 76.0 79.8 11.4 27.1 54.9 54.2
Qwen2.5-Math-1.5B-Instruct
Baseline 84.8 75.8 94.2 94.7 29.4 37.5 57.4 67.7
+ SFT 85.4 75.8 93.5 94.7 31.6 37.5 56.2 67.8
+ LightR 86.7 75.5 93.0 94.1 32.0 37.8 55.2 67.8
DeepSeek-R1-Distill-Qwen-1.5B
Baseline 75.2 54.2 79.9 84.9 16.2 19.1 22.3 50.3
+ SFT 78.2 60.3 81.5 87.4 18.4 21.2 26.2 53.3
+ LightR 79.5 60.2 83.5 87.5 18.0 36.5 26.2 55.9
Qwen2.5-Math-7B
Baseline 57.5 51.8 67.9 72.7 14.0 16.0 69.8 50.0
+ SFT 64.4 63.3 76.2 76.6 12.1 20.5 68.5 54.5
+ LightR 67.9 57.8 77.2 80.6 12.1 16.9 70.5 54.7
Qwen2.5-Math-7B-Instruct
Baseline 95.2 83.2 93.9 95.3 33.8 41.5 69.3 73.2
+ SFT 95.4 83.1 94.1 95.2 38.2 40.7 68.2 73.6
+ LightR 95.8 83.6 93.1 95.2 34.2 39.0 67.8 72.7
  • Trained solely on GSM8K, LightReasoner generalizes effectively for 5 baseline models, achieving consistent gains across 7 benchmarks.

  • +28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV for Qwen2.5-Math-1.5B.

  • +4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench for DeepSeek-R1-Distill-Qwen-1.5B.

  • +10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV for Qwen2.5-Math-7B.

  • Efficiency vs. SFT: 90% less total time, 80% fewer sampled problems, 99% fewer tuned tokens.


⏱️ Efficiency Study

Method Total Time Sampled Problems Tuned Tokens Average Gain
Qwen2.5-Math-1.5B
+ SFT 4.0h 3952 1.77M +7.7%
+ LightReasoner 0.5h 1000 0.02M +11.8%
Qwen2.5-Math-7B
+ SFT 9.5h 6029 2.20M +4.5%
+ LightReasoner 0.75h 1000 0.02M +4.7%
DeepSeek-R1-Distill-Qwen-1.5B
+ SFT 3.6h 6023 5.95M +3.0%
+ LightReasoner 0.5h 1000 0.02M +5.6%
Qwen2.5-Math-1.5B-Instruct
+ SFT 3.4h 7153 2.08M +0.1%
+ LightReasoner 0.4h 1000 0.02M +0.1%
  • πŸ§‘β€πŸ« Supervised Fine-Tuning (SFT):

    • Implemented with rejection sampling, where models are fine-tuned on demonstrations of correct reasoning trajectories.

    • For a fair comparison, SFT adopts the same experimental configuration as LightReasoner, performing LoRA-based fine-tuning exclusively on the GSM8K training set.

    • 🎯 Key Difference:

      • LightReasoner trains on selective next-token predictions, whereas SFT optimizes over full reasoning trajectories β€” an inherent difference dictated by their respective training paradigms.

      • Thus, each LightReasoner training instance corresponds to a single next-token prediction, whereas each SFT example corresponds to a full reasoning trajectory comprising a consecutive series of next-token predictions.

  • πŸ“ˆ Efficiency Evaluation:

    • ⏱️ Time Budget β€” Sampling time plus fine-tuning time, measured on a single NVIDIA H200 GPU without inference accelerators (e.g., vLLM).

    • πŸ“˜ Training Instances β€” Number of distinct GSM8K training set problems used to generate the supervision dataset.

    • πŸ”’ Tuned Tokens β€” Computational overhead measured at the token level.


Figure 3: LightReasoner matches or surpasses SFT performance with remarkable resource efficiency β€” achieving competitive accuracy while cutting training time by 90%, reducing sampled problems by 80%, and requiring 99% fewer tuned tokens.

πŸ’‘ Key Insight:

This marks a fundamental shift in how models are trained β€” targeting critical reasoning steps outperforms brute-force learning, making high-quality AI training achievable even with limited computational resources.


🧠 Expertise-Driven Contrast

Amateur Model Perf. Gap GSM8K MATH SVAMP ASDiv MMLU STEM AVG.
Expert: Qwen2.5-Math-1.5B
Qwen2.5-0.5B 38.2 70.6 59.3 76.0 79.8 54.9 68.1
Qwen2.5-1.5B 35.1 63.4 57.1 69.7 75.7 54.8 64.1
Qwen2.5-Math-1.5B / / / / / / /
Qwen2.5-Math-1.5B-Ins -42.3 41.4 35.5 67.5 66.4 55.0 53.2
Expert Only (Baseline) / 42.5 34.2 68.8 68.1 49.8 52.7
Expert: Qwen2.5-Math-7B
Qwen2.5-0.5B 53.2 67.9 57.8 77.2 80.6 70.5 70.8
Qwen2.5-1.5B 50.1 69.0 56.0 77.6 78.9 69.5 70.2
Qwen2.5-Math-1.5B 15.0 56.9 50.2 63.5 63.4 70.7 60.9
Qwen2.5-Math-1.5B-Ins -27.3 59.4 49.0 68.3 69.6 70.3 63.3
Expert Only (Baseline) / 57.5 51.8 67.9 72.7 69.8 63.9
  • Domain Expertise over Scale: The success of Expert–Amateur collaboration is driven most effectively by domain-specific knowledge rather than model size (e.g., Qwen2.5-Math-1.5B vs. Qwen2.5-1.5B), freeing LightReasoner from rigid scaling constraints.

  • Dependence on Expertise Gap: Performance gains are closely correlated with the size of the expertise gap β€” as the Amateur approaches the Expert’s capability, contrastive signals weaken and improvements diminish.


πŸ” More Insights

Sampling Stage Fine-tuning Stage

πŸ‘ˆ Figure 4(a): Expert–Amateur Pairing Effects β€” Each point represents a fixed Expert model paired with an Amateur model. The performance gains achieved by LightReasoner diminish as the expertise gap narrows.

πŸ‘‰ Figure 4(b): Impact of Ablation β€” Removing key components from LightReasoner consistently reduces performance, revealing their critical contributions.


πŸ† Comparison with Competing Methods

Attribute Time SFT LightR
Full trajectories ⬆️ βœ… ❌
All-token tuning ⬆️ βœ… ❌
Prefix termination ⬇️ ❌ βœ…
Selective tokens ⬇️ ❌ βœ…
Verification-free ⬇️ ❌ βœ…
Attribute Utility CD LightR
Contrast usage / Inference Training
Size-based contrast ⬇️ βœ… ❌
Expertise contrast ⬆️ ❌ βœ…
Persistent benefits ⬆️ ❌ βœ…
Standalone inference ⬆️ ❌ βœ…
  • πŸ‘ˆ Left: Efficiency contrasts at a glance. ⬆️ and ⬇️ indicate whether each aspect helps or hurts the overall efficiency of the method.

  • πŸ‘‰ Right: Key differences between traditional Contrastive Decoding (CD) methods and LightReasoner. ⬆️ and ⬇️ indicate whether each aspect helps or hurts the practicality of the method.


β˜•οΈ Citation

If you find this work useful, please consider citing our paper:

@article{wang2025lightreasoner,
  title={LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?},
  author={Wang, Jingyuan and Chen, Yankai and Li, Zhonghang and Huang, Chao},
  journal={arXiv preprint arXiv:2510.07962},
  year={2025}
}

Thank you for your interest in our work!


πŸ“œ License

This project is released under the MIT License.


❀️ Thanks for visiting ✨ LightReasoner ✨

Views

About

"LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages