A curated collection of research papers on complex reasoning with language models.
- [OlymMATH]: Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models. Paper, Code
TLDR: OlymMATH is a rigorously curated, bilingual Olympiad-level math benchmark comprising 200 printed-source problems—100 “easy” AIME-level and 100 “hard” beyond-AIME—equally sampled across algebra, geometry, number theory, and combinatorics, all reformulated into pure text and restricted to real-number or interval answers to enable automated, rule-based verification.
Keywords:2025arXivMathematical Reasoning#Problems=200Numerical AnswerOlympiad-LevelMulti-lingual
-
[Tina]: Tina: Tiny Reasoning Models via LoRA. Paper, Code
TLDR: LoRA excels at quickly learning the structural/format requirements of multi-step reasoning (e.g., step-by-step chains) while largely preserving the base model’s world knowledge—enabling “less compute, more performance”.
Keywords:2025arXivLoRAReinforcement LearningDeepSeek-R1-Distill-Qwen-1.5BMathematical ReasoningScientific ReasoningGRPO -
[SRPO]: SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM. Paper, Code
TLDR: The approach includes a two-stage training paradigm, where the first stage focuses on mathematical data to develop reasoning skills, and the second integrates coding data to build proficiency in procedural thinking. Key innovations include the introduction of History Resampling (HR), which improves training efficiency by filtering out “easy” samples and maintaining meaningful gradients.
Keywords:2025arXivQwen-2.5-32B-BaseMathematical ReasoningCode GenerationReinforcement LearningGRPO -
[DeepScaleR]: DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL. Blog, Code
TLDR: Start RL at the base model’s context window, then increase to 16 K tokens after 1,040 steps, and finally to 24 K tokens at 1,520 steps to stabilize training under longer contexts. By combining high-quality SFT distillation with RL scaling, we can truly unlock the reasoning potential of LLMs.
Keywords:2025Deepseek-R1-Distilled-Qwen-1.5BMathematical ReasoningReinforcement LearningGRPO -
[RLVR]: Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains. Paper, Code
TLDR: By utilizing expert-written reference answers, it improves the accuracy and reliability of evaluating model-generated responses across various domains. A novel soft reward function, based on generative model token probabilities, enhances assessment granularity beyond binary judgments. These innovations extend RLVR capabilities to complex, unstructured domains like medicine and economics, supported by a compact generative reward model (7B scale).
Keywords:2025arXivComplex ReasoningReinforcement LearningGRPOQwen2.5-7B-InstructReward -
[Heimdall]: Heimdall: test-time scaling on the generative verification. Paper
TLDR: Heimdall reframes solution verification as an RL-trained chain-of-thought task—enabling both a standalone verifier and a tight solver–verifier loop that matches state-of-the-art math contest performance—by combining forward checking (step-by-step validation) with backward checking (testing conclusions against known constraints).
Keywords:2025arXivComplex ReasoningReinforcement LearningDAPOVerifier
-
[Open-RS]: Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't. Paper, Code
TLDR: Small language models can efficiently improve their reasoning skills through reinforcement learning using minimal data and computational resources, with training stabilized by a mix of easy and hard problems and output length managed by cosine rewards, though complex tasks may require longer context limits.
Keywords:2025arXivReinforcement LearningDeepSeek-R1-Distill-Qwen-1.5BMathematical ReasoningGRPO -
[Limit-of-RLVR]: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? Paper, Code
TLDR: The study questions whether Reinforcement Learning with Verifiable Rewards (RLVR) enhances LLM reasoning beyond base models. It finds RLVR boosts initial performance but restricts exploration, limiting its ability to surpass base models at larger sampling sizes. Distillation from stronger models proves more effective in expanding reasoning boundaries. Thus, RLVR falls short in significantly advancing LLM reasoning capabilities compared to alternative training approaches.
Keywords:2025arXivReinforcement LearningQwen2.5-7BMathematical ReasoningGRPOCode GenerationVisual Reasoning -
[Reflection]: Rethinking Reflection in Pre-Training. Paper, Code
TLDR: The ability of LLMs to reflect and self-correct emerges during pre-training, not only during reinforcement learning or fine-tuning.
Keywords:2025arXivReinforcement LearningMathematical ReasoningCode GenerationLogical Reasoning -
[SimpleRL-Zoo]: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild. Paper, Code
TLDR: Fine-tuning on chain-of-thought examples gives models a quick boost, but it actually holds them back later by cramping their exploratory instincts and stopping deeper reasoning from emerging. You’ll also see answers get needlessly long—sometimes just rambling—so length isn’t a reliable sign of insight. And if you force rigid formats (like boxing answers) it chokes off experimentation, lowers the ceiling, and leads to overthinking. Finally, you’ve got to match the training data’s difficulty to what the model can handle, or the whole zero-RL approach falls apart.
Keywords:2025arXivReinforcement LearningMathematical ReasoningGRPOQwen2.5 models
- [Qwen3]: Qwen3: Think Deeper, Act Faster. Blog, Code
TLDR: To develop a hybrid model adept at both detailed reasoning and quick responses, the team executed a four-stage training pipeline: (1) they initiated training via a long chain-of-thought (CoT) cold-start phase, fine-tuning the model with diverse reasoning data across math, coding, and logic; (2) next, they conducted reasoning-based reinforcement learning (RL) to refine the model's exploration and problem-solving abilities; (3) subsequently, they fused rapid-response capabilities into the model by fine-tuning it with combined CoT and standard instruction-tuning data; and (4) finally, general RL was applied across a wide range of real-world tasks to enhance the model’s overall performance, ensuring reliability and mitigating undesirable behaviors.
Keywords:2025Reinforcement LearningQwen3Thinking Mode
- pass@k: Given a problem, we sample k outputs from the model. The pass@k value for this question is 1 if at least one of the k samples passes verification; otherwise, it is 0.
-
perplexity: Given a model
$m$ , a problem$x$ , and a response$\mathbf{Y}=(y_1,\ldots,y_T)$ , the perplexity is defined as the exponentiated average negative log-likelihood of a sequence:
$\mathrm{PPL}_m(\mathbf{Y}|x)=\exp\left(-\frac{1}{T}\sum_{t=1}^T\log P(y_t|x,y_1,\ldots,y_{t-1})\right)$
which reflects the model’s ability to predict the given response$\mathbf{Y}$ conditioned on the prompt$x$ . Lower perplexity indicates that the model has a higher likelihood of generating this response. - Incorrect to Correct Rate (ICR): The rate at which the model successfully corrects an initially incorrect answer into a correct final answer.
- Correct to Incorrect Rate (CIR): The rate at which the model incorrectly alters an initially correct answer into an incorrect final answer.