Jingyuan Wang Β· Yankai Chen Β· Zhonghang Li Β· Chao Huang
Figure 1: LightReasoner delivers superior performance with remarkable token efficiency - achieving consistent improvements in zero-shot pass@1 accuracy while dramatically reducing computational overhead by 90% in total time, 80% in sampled problems, and 99% in tuned tokens compared to traditional SFT.
π‘ Key Insight:
This efficiency breakthrough shows that strategic token selection, rather than exhaustive training, most effectively unlocks the latent potential of LLM reasoning β proving that smarter, not blindly harder is the path to scalable AI improvement.
- [2025/10/14] π New Release:
LRsamplesβ Pre-collected LightReasoner training samples ready for immediate fine-tuning. This dataset enables direct model training without requiring the full sampling pipeline, streamlining reproduction efforts and accelerating downstream research workflows. - [2025/10/14] π New Release: LightReasoner Enhanced Models now available on π€ Hugging Face Hub. Ready-to-use models fine-tuned with our efficient reasoning enhancement approach for immediate deployment and experimentation.
- [2025/10/12] π New Release: Core implementation with Qwen2.5-Math and DeepSeek-R1 models.
β¨ LightReasoner β¨ flips the script on AI training β small language models (SLMs) donβt just learn from large ones (LLMs); they can actually teach LLMs better and faster!
π₯ The Challenge:
Supervised Fine-Tuning (SFT) struggles with three core bottlenecks:
-
π Data-Intensive: Relies on human-labeled or rejection-sampled datasets.
-
βοΈ Uniform Learning: Trains all tokens equally, even though only a small portion truly matter.
-
π Ground-Truth Dependency: Hinders adaptability to new domains and reasoning formats.
π Key Insight:
We allocate 90% of compute to what models already know, while under-investing in the critical 10% that truly drives breakthroughs.
Tested across 7 benchmarks Γ 5 models
π Performance Gains
LightReasoner consistently boosts reasoning accuracy across multiple datasets:
-
π Qwen2.5-Math-1.5B: +28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV
-
π DeepSeek-R1-Distill-Qwen-1.5B: +4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench
-
π Qwen2.5-Math-7B: +10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV
-
π Qwen2.5-Math-1.5B-Instruct: +1.9% on GSM8K, +2.6% on Minerva Math
-
π Strong generalization: Trained only on GSM8K, yet improves across 7 benchmarks
β‘ Efficiency Breakthrough
Taking Qwen2.5-Math-1.5B as an example, LightReasoner achieves dramatic efficiency gains compared with SFT:
-
β±οΈ 90% less total time: 4 hours β 0.5 hours
-
π§Ύ 80% fewer sampled problems: 3,952 β 1,000 problems
-
π’ 99% fewer tuned tokens: 1.77M β 20K tokens
π Key Features
-
π― SLMβLLM Teaching:
Counterintuitively uses smaller βamateurβ models to identify critical reasoning moments where stronger βexpertβ models should focus their learning.
-
β‘ Extreme Token Efficiency:
Achieves 99% fewer tuned tokens than SFT by selectively optimizing high-impact reasoning steps instead of training uniformly on full trajectories.
-
π Three-Stage Lightweight Framework:
(1) Critical step selection via Expert-Amaeteur KLD detection
(2) Contrastive supervision capturing expert-amateur behavioral differentials
(3) Self-distillation for internalizing expert strengths
-
π KL-Guided Learning:
Leverages behavioral divergence between expert and amateur predictions to pinpoint reasoning bottlenecks β all without requiring ground-truth labels.
-
π§ Expertise Over Scale:
Demonstrates that domain expertise gaps, rather than model size, drive effective contrast β even same-sized models with different knowledge can generate powerful teaching signals.
Figure 2: Overview of the LightReasoner framework. (1) Sampling Stage: Expert and Amateur models generate distributions ΟE and ΟA. Informative step selection retains steps with DKL(ΟE β₯ ΟA) > Ξ², and contrastive supervision constructs soft labels vC capturing the Expert's advantage through ExpertβAmateur contrast. (2) Fine-tuning Stage: The Expert model is enhanced by minimizing the KL divergence between its output and vC.
LightReasoner is incredibly easy to use. Weβve designed it to be accessible β so anyone can try it out and experience its βcounterintuitive effectivenessβ firsthand. No sweat β youβll have it set up and running with your model of choice in just a few πͺ simple steps below!
git clone https://github.com/HKUDS/LightReasoner.git
cd LightReasoner1οΈβ£ Install all dependencies:
pip install -r requirements.txt2οΈβ£ Download the Expert and Amateur models of your choice. For example:
π¦ Expert Model
huggingface-cli download Qwen/Qwen2.5-Math-1.5B --local-dir ./Qwen2.5-Math-1.5Bπ£ Amateur Model
huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ./Qwen2.5-0.5B3οΈβ£ Prepare the training data:
python data_prep.pyLightReasoner relies on ExpertβAmateur model pairing to generate supervision signals. Thus, the choice of this pair is crucial to the methodβs success.
βοΈ Rule of Thumb:
The Expert should significantly outperform the Amateur, while the Amateur must remain competent enough to produce coherent reasoning. In practice, performance peaks at a balanced βsweet spotβ rather than simply widening the capability gap.
In our experiments, the Experts include Qwen2.5-Math-1.5B, 7B, their Instruct counterparts, and DeepSeek-R1-Distill variants. The Amateur is fixed as Qwen2.5-0.5B, which offers strong contrast while maintaining sufficient reasoning ability to yield meaningful signals.
Youβre encouraged to explore other model families (e.g., Llama), but keep this balance principle in mind when setting up your ExpertβAmateur collaboration.
-
We use GSM8K by default for its emphasis on step-by-step, broadly applicable logical reasoning rather than domain-specific notation. This ensures that the Amateur, despite lacking math-specific training, can still produce interpretable outputs suitable for contrastive supervision.
-
Youβre absolutely free to try other datasets β LightReasoner is fully adaptable. However, depending on your dataset, you may need to adjust hyperparameters and the choice of Amateur model to ensure stable training and meaningful contrasts.
-
For instance, if you experiment with the MATH dataset β a collection of high-school competition problems that are significantly harder than GSM8K β itβs recommended to upgrade the Amateur model from a generic Qwen2.5 base model to the specialized Qwen2.5-Math variant. The base models were not math-pretrained and may struggle to produce coherent outputs on MATH, potentially destabilizing the expertβamateur contrast.
-
The balance principle still applies here β the Amateur should be adequately weaker than the Expert to produce a clear contrast, yet capable enough to maintain coherent reasoning.
-
This step builds the LightReasoner supervision dataset for downstream fine-tuning. Steps with high Expert-Amateur KLD are retained. These selected steps are transformed into supervision examples that encode the Expertβs strengths through distributional contrast. For full details, please see our paper.
python LightR_sampling.py --max_questions 1000Before running the script, you should:
-
Update the config section with your own relative paths.
-
Adjust the maximum number of problems to control the size of your supervision dataset, tweak the sampling parameters to explore more optimal combinations, and tune the batch size based on your available compute resources.
- To give you a rough picture, in practice, we find that sampling 1,000 problems from the GSM8K training set (with the filtering threshold Ξ² = 0.4) yields approximately 20,000 LightReasoner contrastive samples, which is already sufficient for LoRA fine-tuning to converge on the baseline models we tested.
To save you the trouble of running the sampling pipeline β which, even though much lighter and easier with LightReasoner, can still be daunting for those without ample compute power β we now provide ready-to-go LightReasoner samples that let you jump straight to the fine-tuning stage! π
You can find the following pre-collected LightReasoner sampling datasets in the zip file under LRsamples:
-
LR_Qwen7_gsm8kβ for Qwen2.5-Math-7B -
LR_ds1.5_gsm8kβ for DeepSeek-R1-Distill-Qwen-1.5B -
LR_Qwen1.5_gsm8kβ for Qwen2.5-Math-1.5B-
We provide two versions, one sampled with Torch 3.1 and another with Torch 3.8, as we found that the sampling results (i.e., the modelβs generated outputs) can slightly vary across Torch versions.
-
The performance fluctuation is minimal β typically within 2β3%, with later Torch versions usually performing slightly better.
-
These datasets make it much easier to reproduce our results directly β no additional sampling required! β¨
This step launches the full LightReasoner fine-tuning pipeline β combining dataset loading, LoRA configuration, and contrastive KLD training into a unified workflow.
Foreground (simple run):
python LightR_finetuning.pyBackground (recommended for long training):
nohup python LightR_finetuning.py > finetune.log 2>&1 &Monitor progress:
tail -f finetune.logThe expert model used for fine-tuning must be identical to the one used during sampling β this alignment is essential for correct behavior.
Before running the script, edit the config section to match your setup:
-
πΉ Replace
<path_to_expert_model>with your base model path (e.g.,"./Qwen2.5-Math-7B"or a local folder). -
πΉ Replace
<path_to_training_dataset>with your dataset JSONL file. -
πΉ Replace
<output_directory>with the directory where checkpoints and the final model will be saved. -
πΉ Set
torch_dtypeaccording to your hardware (e.g.,torch.bfloat16for H100,torch.float16for A100).
Use this step to merge the full model (base + LoRA) locally, so it behaves as a standalone model without any LoRA dependency.
python merge.pyBefore running the merge script, update the config section with your own paths:
-
πΉ
base_model_pathto your base model directory (e.g.,./Qwen2.5-Math-7B) -
πΉ
lora_ckpt_pathto your LoRA checkpoint directory (e.g.,./ft_qw7_gsm8k/checkpoint-1000) -
πΉ
merged_model_pathto where you want the merged model to be saved (e.g.,./ft-7B-merged)
All evaluations are performed using the official Qwen2.5-Math toolkit.
Please refer to the evaluation folder for detailed usage and setup instructions.
| Model | GSM8K | MATH | SVAMP | ASDiv | Minerva Math | Olympiad Bench | MMLU STEM | AVG. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B | ||||||||
| Baseline | 42.5 | 34.2 | 68.8 | 68.1 | 9.9 | 23.7 | 49.8 | 42.4 |
| + SFT | 69.2 | 57.1 | 64.1 | 70.2 | 15.1 | 27.6 | 47.7 | 50.1 |
| + LightR | 70.6 | 59.3 | 76.0 | 79.8 | 11.4 | 27.1 | 54.9 | 54.2 |
| Qwen2.5-Math-1.5B-Instruct | ||||||||
| Baseline | 84.8 | 75.8 | 94.2 | 94.7 | 29.4 | 37.5 | 57.4 | 67.7 |
| + SFT | 85.4 | 75.8 | 93.5 | 94.7 | 31.6 | 37.5 | 56.2 | 67.8 |
| + LightR | 86.7 | 75.5 | 93.0 | 94.1 | 32.0 | 37.8 | 55.2 | 67.8 |
| DeepSeek-R1-Distill-Qwen-1.5B | ||||||||
| Baseline | 75.2 | 54.2 | 79.9 | 84.9 | 16.2 | 19.1 | 22.3 | 50.3 |
| + SFT | 78.2 | 60.3 | 81.5 | 87.4 | 18.4 | 21.2 | 26.2 | 53.3 |
| + LightR | 79.5 | 60.2 | 83.5 | 87.5 | 18.0 | 36.5 | 26.2 | 55.9 |
| Qwen2.5-Math-7B | ||||||||
| Baseline | 57.5 | 51.8 | 67.9 | 72.7 | 14.0 | 16.0 | 69.8 | 50.0 |
| + SFT | 64.4 | 63.3 | 76.2 | 76.6 | 12.1 | 20.5 | 68.5 | 54.5 |
| + LightR | 67.9 | 57.8 | 77.2 | 80.6 | 12.1 | 16.9 | 70.5 | 54.7 |
| Qwen2.5-Math-7B-Instruct | ||||||||
| Baseline | 95.2 | 83.2 | 93.9 | 95.3 | 33.8 | 41.5 | 69.3 | 73.2 |
| + SFT | 95.4 | 83.1 | 94.1 | 95.2 | 38.2 | 40.7 | 68.2 | 73.6 |
| + LightR | 95.8 | 83.6 | 93.1 | 95.2 | 34.2 | 39.0 | 67.8 | 72.7 |
-
Trained solely on GSM8K, LightReasoner generalizes effectively for 5 baseline models, achieving consistent gains across 7 benchmarks.
-
+28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV for Qwen2.5-Math-1.5B.
-
+4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench for DeepSeek-R1-Distill-Qwen-1.5B.
-
+10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV for Qwen2.5-Math-7B.
-
Efficiency vs. SFT: 90% less total time, 80% fewer sampled problems, 99% fewer tuned tokens.
| Method | Total Time | Sampled Problems | Tuned Tokens | Average Gain |
|---|---|---|---|---|
| Qwen2.5-Math-1.5B | ||||
| + SFT | 4.0h | 3952 | 1.77M | +7.7% |
| + LightReasoner | 0.5h | 1000 | 0.02M | +11.8% |
| Qwen2.5-Math-7B | ||||
| + SFT | 9.5h | 6029 | 2.20M | +4.5% |
| + LightReasoner | 0.75h | 1000 | 0.02M | +4.7% |
| DeepSeek-R1-Distill-Qwen-1.5B | ||||
| + SFT | 3.6h | 6023 | 5.95M | +3.0% |
| + LightReasoner | 0.5h | 1000 | 0.02M | +5.6% |
| Qwen2.5-Math-1.5B-Instruct | ||||
| + SFT | 3.4h | 7153 | 2.08M | +0.1% |
| + LightReasoner | 0.4h | 1000 | 0.02M | +0.1% |
-
π§βπ« Supervised Fine-Tuning (SFT):
-
Implemented with rejection sampling, where models are fine-tuned on demonstrations of correct reasoning trajectories.
-
For a fair comparison, SFT adopts the same experimental configuration as LightReasoner, performing LoRA-based fine-tuning exclusively on the GSM8K training set.
-
π― Key Difference:
-
LightReasoner trains on selective next-token predictions, whereas SFT optimizes over full reasoning trajectories β an inherent difference dictated by their respective training paradigms.
-
Thus, each LightReasoner training instance corresponds to a single next-token prediction, whereas each SFT example corresponds to a full reasoning trajectory comprising a consecutive series of next-token predictions.
-
-
-
π Efficiency Evaluation:
-
β±οΈ Time Budget β Sampling time plus fine-tuning time, measured on a single NVIDIA H200 GPU without inference accelerators (e.g., vLLM).
-
π Training Instances β Number of distinct GSM8K training set problems used to generate the supervision dataset.
-
π’ Tuned Tokens β Computational overhead measured at the token level.
-
Figure 3: LightReasoner matches or surpasses SFT performance with remarkable resource efficiency β achieving competitive accuracy while cutting training time by 90%, reducing sampled problems by 80%, and requiring 99% fewer tuned tokens.
π‘ Key Insight:
This marks a fundamental shift in how models are trained β targeting critical reasoning steps outperforms brute-force learning, making high-quality AI training achievable even with limited computational resources.
| Amateur Model | Perf. Gap | GSM8K | MATH | SVAMP | ASDiv | MMLU STEM | AVG. |
|---|---|---|---|---|---|---|---|
| Expert: Qwen2.5-Math-1.5B | |||||||
| Qwen2.5-0.5B | 38.2 | 70.6 | 59.3 | 76.0 | 79.8 | 54.9 | 68.1 |
| Qwen2.5-1.5B | 35.1 | 63.4 | 57.1 | 69.7 | 75.7 | 54.8 | 64.1 |
| Qwen2.5-Math-1.5B | / | / | / | / | / | / | / |
| Qwen2.5-Math-1.5B-Ins | -42.3 | 41.4 | 35.5 | 67.5 | 66.4 | 55.0 | 53.2 |
| Expert Only (Baseline) | / | 42.5 | 34.2 | 68.8 | 68.1 | 49.8 | 52.7 |
| Expert: Qwen2.5-Math-7B | |||||||
| Qwen2.5-0.5B | 53.2 | 67.9 | 57.8 | 77.2 | 80.6 | 70.5 | 70.8 |
| Qwen2.5-1.5B | 50.1 | 69.0 | 56.0 | 77.6 | 78.9 | 69.5 | 70.2 |
| Qwen2.5-Math-1.5B | 15.0 | 56.9 | 50.2 | 63.5 | 63.4 | 70.7 | 60.9 |
| Qwen2.5-Math-1.5B-Ins | -27.3 | 59.4 | 49.0 | 68.3 | 69.6 | 70.3 | 63.3 |
| Expert Only (Baseline) | / | 57.5 | 51.8 | 67.9 | 72.7 | 69.8 | 63.9 |
-
Domain Expertise over Scale: The success of ExpertβAmateur collaboration is driven most effectively by domain-specific knowledge rather than model size (e.g., Qwen2.5-Math-1.5B vs. Qwen2.5-1.5B), freeing LightReasoner from rigid scaling constraints.
-
Dependence on Expertise Gap: Performance gains are closely correlated with the size of the expertise gap β as the Amateur approaches the Expertβs capability, contrastive signals weaken and improvements diminish.
π Figure 4(a): ExpertβAmateur Pairing Effects β Each point represents a fixed Expert model paired with an Amateur model. The performance gains achieved by LightReasoner diminish as the expertise gap narrows.
π Figure 4(b): Impact of Ablation β Removing key components from LightReasoner consistently reduces performance, revealing their critical contributions.
|
|
-
π Left: Efficiency contrasts at a glance. β¬οΈ and β¬οΈ indicate whether each aspect helps or hurts the overall efficiency of the method.
-
π Right: Key differences between traditional Contrastive Decoding (CD) methods and LightReasoner. β¬οΈ and β¬οΈ indicate whether each aspect helps or hurts the practicality of the method.
If you find this work useful, please consider citing our paper:
@article{wang2025lightreasoner,
title={LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?},
author={Wang, Jingyuan and Chen, Yankai and Li, Zhonghang and Huang, Chao},
journal={arXiv preprint arXiv:2510.07962},
year={2025}
}Thank you for your interest in our work!
This project is released under the MIT License.