Q UASAR 3.
0
G OLDEN F ORMULA IN R EASONING M ODELS
Eyad Gomaa SILX AI Edvard Castell - Kagari Systems
Partner: Lambda Cloud
A BSTRACT
With Quasar 3.0 introducing a new training pipeline, TTM Training, models can recognize important
tokens and assign them higher attention. This research aims to discover the golden formula for solving
problems efficiently. Beginning with long-context CoT reasoning models, we use Reinforcement
Learning (RL) and GRPO to enable the model to dynamically find the minimal reasoning length
required to reach an optimal solution. By avoiding unnecessary complexity and preventing over-
thinking, Quasar 3.0 enhances problem-solving efficiency while maintaining high accuracy. This
breakthrough paves the way for more intelligent, concise, and effective reasoning in LLMs.
1 Introduction
Recent advancements in reasoning models, such as the DeepSeek R1 series, have significantly boosted intelligence but
at the cost of high token usage. Studies show that the more tokens a model generates, the more likely it is to hallucinate
or overthink simple problems, leading to inefficiencies and increased expenses.
We solved this problem by discovering the golden formula for problem-solving reducing reasoning length while
preserving accuracy. This breakthrough minimizes costs and introduces a new training pipeline: TTM Training. TTM
enables models to distinguish important tokens from less relevant ones, assigning higher temperature (or attention) to
critical information, optimizing reasoning efficiency, and ensuring more intelligent decision-making.
We were able to :
• Create a new training pipeline that provides a free +10 boost on all benchmarks, leading to better generalization
and more room for improvement, achieved at a cost of just $20 in GPU hours using a single H100 GPU.
• Develop an RL formula for cheaper, faster reasoning models by optimizing reasoning length and reducing
unnecessary token usage.
• Solve the overthinking and hallucination in reasoning models, when giving the model too few input tokens
cause hallucinations as the model fills in gaps, while too many lead to overthinking and excessive token
generation. Our approach balances input length, ensuring efficient and accurate reasoning.
This being built by just two people made open-sourcing everything even more challenging. Howeve we welcome
contributions from the community to help refine and enhance TTM Training.
If you’re passionate about optimizing reasoning efficiency in LLMs, join us and contribute to the project here:
https://github.com/SILX-LABS/TTM. Let’s push the boundaries of AI together!
Quasar 3.0: Golden Formula in Reasoning Models
2 Benchmarks
Quasar 3.0 DeepSeek-R1-Distill-Qwen-7B Quasar 3.0 (TTM + Qwen-7B) Qwen-7B
99.4
100
92.2
84
80 75
68
60
60 55.5
Score
55.2
49.1
45
40 33
30 32.2
25.3
20 16
12.5
0
AIME 2024 Math 500 GPQA Diamond LiveCodeBench
Note: This is a distilled model from the 400B parameter Quasar 3.0. Data sets / model for the larger version will be available soon.
3 Conclusion:
Training in TTM improves the accuracy of the model by 5-10% at any baseline by allowing the model to identify and
prioritize important tokens during prediction. This improves problem-solving efficiency while maintaining precision.
In particular, TTM Stage trained in just 3 hours using a single H100 GPU, with a total training cost of $26 approximately
$9.75 per Model. This demonstrates its efficiency, making it a cost-effective, yet highly impactful improvement to
reasoning models.
Token Temperature Mechanism (TTM) as a Training Framework
In the latest paper, Guidance is All You Need [1], the concept of the Token Temperature Mechanism (TTM) is introduced,
which helps models identify hot tokens (important tokens) and cold tokens (less relevant tokens).
We extend this idea by developing TTM as a training framework, optimizing how models assign attention based on
token importance.
3.1 How TTM Works
The process begins with an input sequence passing through the Temperature Layer, where the model determines
which tokens are critical for reasoning. Instead of treating all tokens equally, the model assigns an importance score
to each token based on multiple factors, including local patterns captured through convolutions, token frequency, and
positional significance in the sequence.
This importance score directly influences the token’s assigned temperature, where higher temperatures correspond to
greater relevance in the reasoning process. By dynamically adjusting attention to focus on essential tokens, the model
avoids unnecessary complexity, optimizes reasoning steps, and improves overall efficiency.
2
Quasar 3.0: Golden Formula in Reasoning Models
Prompt:
"How many r’s in strawberry?"
Token Temperature Mechanism
how many r’s in strawberry
(Medium) (Cold) (Hot) (Cold) (Hot)
3.2 TTM Algorithm
Let xi be the i-th token in a sequence of length n. The importance score S(xi ) is computed as:
S(xi ) = α · f (xi ) + β · p(xi ) + γ · C(xi ) (1)
where:
• f (xi ) is the frequency of token xi in the dataset.
• p(xi ) represents the positional encoding function.
• C(xi ) captures local patterns using convolution-based features.
• α, β, γ are scaling hyperparameters.
The temperature assignment for each token is then:
S(xi )
T (xi ) = Pn (2)
j=1 S(xj )
where T (xi ) represents the relative importance (temperature) of token xi , ensuring that higher-scoring tokens receive
greater attention during prediction.
Effect of TTM Layer on Token Temperature
After the TTM (Token Temperature Mechanism) layer, the model’s temperature for tokens changes, optimizing the
focus on important words.
Token Before Training After Training
How 0.341 0.554
many 0.408 0.328
r 0.412 0.642
’s 0.600 0.900
in 0.311 0.250
strawberry 0.759 1.125
? 1.100 1.500
Table 1: Token Temperature Analysis Before and After Training
3
Quasar 3.0: Golden Formula in Reasoning Models
4 Overview of the Training Process
The training process for our model is fundamentally different from traditional supervised learning approaches. Instead
of mapping inputs to predefined outputs, the model is trained to process input tokens by dynamically adjusting their
token temperature values. This mechanism enhances the model’s ability to understand the importance of each token
based on its context, frequency, and position.
4.1 Dataset Structure and Preparation
Our dataset consists of 1k samples. HF Dataset. this dataset has no solution column or predefined output labels.
The purpose is not to predict answers but to refine how the model interprets and prioritizes tokens.
Each sample in the dataset is tokenized, and token characteristics such as frequency, position, and contextual dependen-
cies are extracted. These characteristics influence the temperature value assigned to each token, which in turn modulates
the attention mechanism during training.
To begin, the dataset undergoes preprocessing:
1. Character Frequency Calculation: The occurrence of each character within the token is counted to derive a
frequency score.
2. Positional Indexing: The position of the token within the sequence is noted, as earlier and later tokens may
hold different contextual significance.
The computed token attributes serve as inputs to the Token Temperature Mechanism (TTM), which assigns a dynamic
temperature to each token.
4.2 Computation of Token Temperature
Once token characteristics are extracted, we compute their temperatures. This process follows three key steps:
1. Calculate Frequency Score: Tokens appearing frequently in different contexts are considered less informative.
Their importance is inversely proportional to their frequency. The frequency score is computed as:
P
character occurrences in token
freq_score = (3)
token length + ϵ
where ϵ is a small constant to prevent division by zero.
2. Determine Positional Weight: A token’s position in a sequence influences its contextual importance. A
positional score is assigned using:
token index + 1
pos_score = (4)
total tokens
3. Compute Token Temperature: The final temperature of each token is determined by combining frequency
and positional information:
1
Ttoken = + pos_score (5)
freq_score + 1
This formulation ensures that rare but contextually significant tokens receive higher temperatures, while common or
functionally redundant tokens receive lower temperatures.
4.3 Normalization of Token Temperatures
After computing raw token temperatures, we normalize them to ensure consistency across sequences. The maximum
token temperature in a given sequence is identified:
Tmax = max(Ttoken ) (6)
Each token’s temperature is then normalized relative to this maximum value:
Ttoken
Tnormalized = (7)
Tmax
This ensures that token temperatures remain within a controlled range and do not introduce instability into the model.
4
Quasar 3.0: Golden Formula in Reasoning Models
4.4 Integration into Attention Weights
Once token temperatures are normalized, they are incorporated into the model’s attention mechanism. Attention scores,
which determine how much focus is given to each token, are modified using:
A′ = A · Tnormalized (8)
This adjustment enables the model to dynamically shift its focus based on the learned importance of each token.
4.5 Optimization Through Temperature-Weighted Loss
A key aspect of training is optimizing the model’s behavior through a loss function that takes token temperature into
account. This involves two components:
1. Language Modeling Loss: The standard loss function for training language models is computed as:
X
LLM = − Ptrue log Ppred (9)
2. Temperature-Weighted Loss: To reinforce the role of token temperatures, we introduce a temperature-based
loss component:
LTTM = LLM · mean(Tnormalized ) (10)
The final loss function balances both components:
L = (1 − α)LLM + αLTTM (11)
where α controls the relative importance of the temperature-based adjustment.
5 Conclusion
TTM dynamically modulates token-level attention based on contextual importance.By adjusting token weights dy-
namically, Quasar 3.0 ensures optimal information retention while filtering out noise, enhancing reasoning depth and
reducing unnecessary computation.
Golden Formula in RL Training
We aimed to discover the best formula for reinforcement learning (RL) training in reasoning models.
While some researchers suggest that models should “say more” and “think more,” we propose the opposite: let them
think less, but focus on the best kind of thinking.
Through experimentation, we found that many reasoning models such as DeepSeek-R1 or OpenAI o3 generate a
significant number of unnecessary thinking tokens. These tokens do not contribute meaningfully to the reasoning
process and can be eliminated without hurting the model’s ability to think effectively.
This insight opens the door to more efficient reasoning models by training them to focus only on essential thoughts.
Here is why and how:
Why?
Reasoning models often engage in extended internal dialogue, which may lead to overthinking. This overthinking can
cause the model to go off-topic, mix languages, or produce other inconsistencies. As a result, reasoning models show a
higher rate of hallucinations compared to base models that lack reasoning tokens.
To solve this issue:
We reduce the number of tokens used for reasoning without sacrificing accuracy. But how?
After conducting extensive research, we discovered that in DeepSeek models, overthinking and off-topic reasoning
often correlate with specific tokens that appear frequently. These include:
"wait" A token that indicates hesitation or pause.
"alternatively" A token suggesting an alternative option.
"hmm" A token often used to denote thinking or uncertainty.
5
Quasar 3.0: Golden Formula in Reasoning Models
In a DeepSeek-R1 task, the token "wait" appeared over 80 times, and "alternatively" over 30 times in a single
task. The prompt was:
Alice and Bob play the following game. A stack of n tokens lies before them. The players take turns
with Alice going first. On each turn, the player removes either 1 token or 4 tokens from the stack.
Whoever removes the last token wins. Find the number of positive integers n less than or equal to
2024 for which there exists a strategy for Bob that guarantees that Bob will win the game regardless
of Alice’s play.
Graph 1: Input Tokens vs Reasoning Drift + Hallucinations
100
“wait” frequency
Frequency / Hallucination Rate (%)
“alternatively” frequency
80 Hallucination Rate
60
40
20
0
200 400 600 800 1,000
Input Tokens
As input tokens increase, reasoning-related tokens such as “wait” and “alternatively” become more frequent. This rise
strongly correlates with increased hallucination rates in reasoning models.
The issue is not with these specific tokens themselves, but with what they trigger! the model begins generating
unnecessary reasoning paths. While exploration is valuable, creating too many redundant or faulty paths leads to errors
that could have been avoided.
That’s why we introduce our (Quasar GRPO Algorithm) — a method to optimize both the length and accuracy of
reasoning in models.
Dr. GRPO [2]
Using Dr.GRPO Algorithm We are able to scale our RL training while incorporating our Path Quality reward creating
the Golden Formula!
G |O|
1XX πθ (oi,t |q, oi,<t ) πθ (oi,t |q, oi,<t )
LDr.GRPO (θ) = min · Âdr
i,t , clip , 1 − ϵ, 1 + ϵ · Â dr
i,t (12)
G t=1 i,t=1 πθold (oi,t |q, oi,<t ) πθold (oi,t |q, oi,<t )
Advantage Function with Reasoning Efficiency Bias
Âdr
i,t = R(q, oi ) − µR + λ · ρ(oi ) (13)
| {z } | {z }
Standardized Advantage Reasoning Efficiency Bias
G
1X
µR = R(q, oj ) (14)
G j=1
6
Quasar 3.0: Golden Formula in Reasoning Models
6 Token Penalty and Reasoning Path Quality
In our optimization strategy, we incorporate a token-aware reward function that implicitly discourages redundant or
low-value generation patterns commonly observed in reasoning models. Rather than hard-coding a list of specific
tokens to avoid, the reward system dynamically penalizes linguistic patterns that historically correlate with off-topic
reasoning, hallucinations, or unnecessary verbosity.
We incentivize shorter, more efficient reasoning paths by rewarding outputs that maintain semantic accuracy while
minimizing token bloat. This encourages the model not only to reach the correct answer but to do so with the most
optimal reasoning route.
To further enhance quality, our scoring mechanism prefers reasoning trajectories that are not just valid, but also minimal,
elegant, and computationally efficient leading to better generalization and reduced inference cost.
This approach helps the model stay within a cognitive budget (e.g., 8k–16k tokens) and avoids “reward hacking” by
maintaining a balance between brevity and correctness. Paths that are correct but overly verbose receive lower rewards
than those which are both correct and concise.
Reasoning Path Quality Reward Function: We define the path-based reward term Rpath as follows:
Rpath (o) = clip (α · (ngold − nused ) + δ, 0, 1) + β · ⊮flexible-optimal (o) (15)
Where:
• nused is the number of reasoning tokens actually used in the output.
• ngold is a reference minimal path length, derived from training data.
• α and δ control the strength and base of length-based reward.
• β boosts solutions that follow paths marked “flexible-optimal,” i.e., correct, elegant, and generalizable.
• ⊮flexible-optimal (o) is an indicator function scoring high-quality reasoning paths selected during validation.
This reward formulation encourages the model to not only be correct, but to generalize through the most streamlined
reasoning paths under a token and compute budget.
These tokens are **not inherently bad**, but their uncontrolled repetition leads to hallucinations, off-topic answers,
and inflated reasoning graphs.
Conclusion
Dr. GRPO + the Quasar 3.0 length modifications balances reward fidelity and computational discipline. It avoids reward
hacking, discourages synthetic verbosity, and promotes reasoning quality over length. It is the backbone of the Quasar
3.0 architecture, enforcing both correctness and elegance in autoregressive decision-making.
7 Conclusion
Quasar 3.0 introduces a structured, multi-stage scaling approach for LLMs that improves reasoning, efficiency, and
adaptability. By integrating SFT, RL, and TTM in a systematic framework, Quasar 3.0 sets a new standard for scalable
AI training.
Resources
For model weights and datasets, keep an eye on our Hugging Face profile: Hugging Face Profile.
Acknowledgments
We would like to express our sincere gratitude to our training partner, Lambda Cloud, for providing us with high-end
GPUs and the support we need. We couldn’t thank them more for making this work possible.
We would like to thank Edvard Castell from Kagari systems for helping us with this project. With ideas and support, we
can’t imagine this project without this help. We appreciate this support and look forward to future work and potential
collaborations again.
7
Quasar 3.0: Golden Formula in Reasoning Models
References
[1] Guidance is All You Need. Available: https://arxiv.org/abs/2412.06822
[2] Dr. GRPO: Generalized Reward Policy Optimization. Available: https://arxiv.org/pdf/2503.20783