0% found this document useful (0 votes)

51 views8 pages

Quasar 3.0

Quasar 3.0 introduces a new training pipeline called TTM Training, which enhances reasoning models by allowing them to prioritize important tokens and optimize reasoning length, thereby improving efficiency and accuracy. The research identifies a 'golden formula' that reduces unnecessary token usage and overthinking, leading to cost-effective and intelligent problem-solving. Contributions from the community are welcomed to further refine this innovative approach to reasoning in large language models.

Uploaded by

hryp562

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views8 pages

Quasar 3.0

Uploaded by

hryp562

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Q UASAR 3.

0
G OLDEN F ORMULA IN R EASONING M ODELS

Eyad Gomaa SILX AI Edvard Castell - Kagari Systems

Partner: Lambda Cloud

A BSTRACT
With Quasar 3.0 introducing a new training pipeline, TTM Training, models can recognize important
tokens and assign them higher attention. This research aims to discover the golden formula for solving
problems efficiently. Beginning with long-context CoT reasoning models, we use Reinforcement
Learning (RL) and GRPO to enable the model to dynamically find the minimal reasoning length
required to reach an optimal solution. By avoiding unnecessary complexity and preventing over-
thinking, Quasar 3.0 enhances problem-solving efficiency while maintaining high accuracy. This
breakthrough paves the way for more intelligent, concise, and effective reasoning in LLMs.

1 Introduction
Recent advancements in reasoning models, such as the DeepSeek R1 series, have significantly boosted intelligence but
at the cost of high token usage. Studies show that the more tokens a model generates, the more likely it is to hallucinate
or overthink simple problems, leading to inefficiencies and increased expenses.
We solved this problem by discovering the golden formula for problem-solving reducing reasoning length while
preserving accuracy. This breakthrough minimizes costs and introduces a new training pipeline: TTM Training. TTM
enables models to distinguish important tokens from less relevant ones, assigning higher temperature (or attention) to
critical information, optimizing reasoning efficiency, and ensuring more intelligent decision-making.
We were able to :

• Create a new training pipeline that provides a free +10 boost on all benchmarks, leading to better generalization
and more room for improvement, achieved at a cost of just $20 in GPU hours using a single H100 GPU.

• Develop an RL formula for cheaper, faster reasoning models by optimizing reasoning length and reducing
unnecessary token usage.

• Solve the overthinking and hallucination in reasoning models, when giving the model too few input tokens
cause hallucinations as the model fills in gaps, while too many lead to overthinking and excessive token
generation. Our approach balances input length, ensuring efficient and accurate reasoning.

This being built by just two people made open-sourcing everything even more challenging. Howeve we welcome
contributions from the community to help refine and enhance TTM Training.
If you’re passionate about optimizing reasoning efficiency in LLMs, join us and contribute to the project here:
https://github.com/SILX-LABS/TTM. Let’s push the boundaries of AI together!
Quasar 3.0: Golden Formula in Reasoning Models

2 Benchmarks

Quasar 3.0 DeepSeek-R1-Distill-Qwen-7B Quasar 3.0 (TTM + Qwen-7B) Qwen-7B

99.4
100
92.2
84
80 75
68
60
60 55.5
Score

55.2
49.1
45
40 33
30 32.2
25.3
20 16
12.5

0
AIME 2024 Math 500 GPQA Diamond LiveCodeBench

Note: This is a distilled model from the 400B parameter Quasar 3.0. Data sets / model for the larger version will be available soon.

3 Conclusion:

Training in TTM improves the accuracy of the model by 5-10% at any baseline by allowing the model to identify and
prioritize important tokens during prediction. This improves problem-solving efficiency while maintaining precision.
In particular, TTM Stage trained in just 3 hours using a single H100 GPU, with a total training cost of $26 approximately
$9.75 per Model. This demonstrates its efficiency, making it a cost-effective, yet highly impactful improvement to
reasoning models.

Token Temperature Mechanism (TTM) as a Training Framework

In the latest paper, Guidance is All You Need [1], the concept of the Token Temperature Mechanism (TTM) is introduced,
which helps models identify hot tokens (important tokens) and cold tokens (less relevant tokens).
We extend this idea by developing TTM as a training framework, optimizing how models assign attention based on
token importance.

3.1 How TTM Works

The process begins with an input sequence passing through the Temperature Layer, where the model determines
which tokens are critical for reasoning. Instead of treating all tokens equally, the model assigns an importance score
to each token based on multiple factors, including local patterns captured through convolutions, token frequency, and
positional significance in the sequence.
This importance score directly influences the token’s assigned temperature, where higher temperatures correspond to
greater relevance in the reasoning process. By dynamically adjusting attention to focus on essential tokens, the model
avoids unnecessary complexity, optimizes reasoning steps, and improves overall efficiency.

2
Quasar 3.0: Golden Formula in Reasoning Models

Prompt:
"How many r’s in strawberry?"

Token Temperature Mechanism

how many r’s in strawberry

(Medium) (Cold) (Hot) (Cold) (Hot)

3.2 TTM Algorithm

Let xi be the i-th token in a sequence of length n. The importance score S(xi ) is computed as:

S(xi ) = α · f (xi ) + β · p(xi ) + γ · C(xi ) (1)

where:

• f (xi ) is the frequency of token xi in the dataset.

• p(xi ) represents the positional encoding function.
• C(xi ) captures local patterns using convolution-based features.
• α, β, γ are scaling hyperparameters.

The temperature assignment for each token is then:

S(xi )
T (xi ) = Pn (2)
j=1 S(xj )

where T (xi ) represents the relative importance (temperature) of token xi , ensuring that higher-scoring tokens receive
greater attention during prediction.

Effect of TTM Layer on Token Temperature

After the TTM (Token Temperature Mechanism) layer, the model’s temperature for tokens changes, optimizing the
focus on important words.

Token Before Training After Training

How 0.341 0.554
many 0.408 0.328
r 0.412 0.642
’s 0.600 0.900
in 0.311 0.250
strawberry 0.759 1.125
? 1.100 1.500
Table 1: Token Temperature Analysis Before and After Training

3
Quasar 3.0: Golden Formula in Reasoning Models

4 Overview of the Training Process

The training process for our model is fundamentally different from traditional supervised learning approaches. Instead
of mapping inputs to predefined outputs, the model is trained to process input tokens by dynamically adjusting their
token temperature values. This mechanism enhances the model’s ability to understand the importance of each token
based on its context, frequency, and position.

4.1 Dataset Structure and Preparation

Our dataset consists of 1k samples. HF Dataset. this dataset has no solution column or predefined output labels.
The purpose is not to predict answers but to refine how the model interprets and prioritizes tokens.
Each sample in the dataset is tokenized, and token characteristics such as frequency, position, and contextual dependen-
cies are extracted. These characteristics influence the temperature value assigned to each token, which in turn modulates
the attention mechanism during training.
To begin, the dataset undergoes preprocessing:

1. Character Frequency Calculation: The occurrence of each character within the token is counted to derive a
frequency score.
2. Positional Indexing: The position of the token within the sequence is noted, as earlier and later tokens may
hold different contextual significance.

The computed token attributes serve as inputs to the Token Temperature Mechanism (TTM), which assigns a dynamic
temperature to each token.

4.2 Computation of Token Temperature

Once token characteristics are extracted, we compute their temperatures. This process follows three key steps:

1. Calculate Frequency Score: Tokens appearing frequently in different contexts are considered less informative.
Their importance is inversely proportional to their frequency. The frequency score is computed as:
P
character occurrences in token
freq_score = (3)
token length + ϵ
where ϵ is a small constant to prevent division by zero.
2. Determine Positional Weight: A token’s position in a sequence influences its contextual importance. A
positional score is assigned using:
token index + 1
pos_score = (4)
total tokens
3. Compute Token Temperature: The final temperature of each token is determined by combining frequency
and positional information:
1
Ttoken = + pos_score (5)
freq_score + 1

This formulation ensures that rare but contextually significant tokens receive higher temperatures, while common or
functionally redundant tokens receive lower temperatures.

4.3 Normalization of Token Temperatures

After computing raw token temperatures, we normalize them to ensure consistency across sequences. The maximum
token temperature in a given sequence is identified:
Tmax = max(Ttoken ) (6)

Each token’s temperature is then normalized relative to this maximum value:

Ttoken
Tnormalized = (7)
Tmax

This ensures that token temperatures remain within a controlled range and do not introduce instability into the model.

4
Quasar 3.0: Golden Formula in Reasoning Models

4.4 Integration into Attention Weights

Once token temperatures are normalized, they are incorporated into the model’s attention mechanism. Attention scores,
which determine how much focus is given to each token, are modified using:
A′ = A · Tnormalized (8)

This adjustment enables the model to dynamically shift its focus based on the learned importance of each token.

4.5 Optimization Through Temperature-Weighted Loss

A key aspect of training is optimizing the model’s behavior through a loss function that takes token temperature into
account. This involves two components:

1. Language Modeling Loss: The standard loss function for training language models is computed as:
X
LLM = − Ptrue log Ppred (9)

2. Temperature-Weighted Loss: To reinforce the role of token temperatures, we introduce a temperature-based

loss component:
LTTM = LLM · mean(Tnormalized ) (10)

The final loss function balances both components:

L = (1 − α)LLM + αLTTM (11)
where α controls the relative importance of the temperature-based adjustment.

5 Conclusion
TTM dynamically modulates token-level attention based on contextual importance.By adjusting token weights dy-
namically, Quasar 3.0 ensures optimal information retention while filtering out noise, enhancing reasoning depth and
reducing unnecessary computation.

Golden Formula in RL Training

We aimed to discover the best formula for reinforcement learning (RL) training in reasoning models.

While some researchers suggest that models should “say more” and “think more,” we propose the opposite: let them
think less, but focus on the best kind of thinking.
Through experimentation, we found that many reasoning models such as DeepSeek-R1 or OpenAI o3 generate a
significant number of unnecessary thinking tokens. These tokens do not contribute meaningfully to the reasoning
process and can be eliminated without hurting the model’s ability to think effectively.
This insight opens the door to more efficient reasoning models by training them to focus only on essential thoughts.
Here is why and how:

Why?
Reasoning models often engage in extended internal dialogue, which may lead to overthinking. This overthinking can
cause the model to go off-topic, mix languages, or produce other inconsistencies. As a result, reasoning models show a
higher rate of hallucinations compared to base models that lack reasoning tokens.

To solve this issue:

We reduce the number of tokens used for reasoning without sacrificing accuracy. But how?
After conducting extensive research, we discovered that in DeepSeek models, overthinking and off-topic reasoning
often correlate with specific tokens that appear frequently. These include:

"wait" A token that indicates hesitation or pause.

"alternatively" A token suggesting an alternative option.
"hmm" A token often used to denote thinking or uncertainty.

5
Quasar 3.0: Golden Formula in Reasoning Models

In a DeepSeek-R1 task, the token "wait" appeared over 80 times, and "alternatively" over 30 times in a single
task. The prompt was:

Alice and Bob play the following game. A stack of n tokens lies before them. The players take turns
with Alice going first. On each turn, the player removes either 1 token or 4 tokens from the stack.
Whoever removes the last token wins. Find the number of positive integers n less than or equal to
2024 for which there exists a strategy for Bob that guarantees that Bob will win the game regardless
of Alice’s play.

Graph 1: Input Tokens vs Reasoning Drift + Hallucinations

100
“wait” frequency
Frequency / Hallucination Rate (%)

“alternatively” frequency
80 Hallucination Rate

0
200 400 600 800 1,000
Input Tokens

As input tokens increase, reasoning-related tokens such as “wait” and “alternatively” become more frequent. This rise
strongly correlates with increased hallucination rates in reasoning models.

The issue is not with these specific tokens themselves, but with what they trigger! the model begins generating
unnecessary reasoning paths. While exploration is valuable, creating too many redundant or faulty paths leads to errors
that could have been avoided.

That’s why we introduce our (Quasar GRPO Algorithm) — a method to optimize both the length and accuracy of
reasoning in models.
Dr. GRPO [2]
Using Dr.GRPO Algorithm We are able to scale our RL training while incorporating our Path Quality reward creating
the Golden Formula!
G |O|
1XX πθ (oi,t |q, oi,<t ) πθ (oi,t |q, oi,<t )
LDr.GRPO (θ) = min · Âdr
i,t , clip , 1 − ϵ, 1 + ϵ · Â dr
i,t (12)
G t=1 i,t=1 πθold (oi,t |q, oi,<t ) πθold (oi,t |q, oi,<t )

Advantage Function with Reasoning Efficiency Bias

Âdr
i,t = R(q, oi ) − µR + λ · ρ(oi ) (13)
| {z } | {z }
Standardized Advantage Reasoning Efficiency Bias
G
1X
µR = R(q, oj ) (14)
G j=1

6
Quasar 3.0: Golden Formula in Reasoning Models

6 Token Penalty and Reasoning Path Quality

In our optimization strategy, we incorporate a token-aware reward function that implicitly discourages redundant or
low-value generation patterns commonly observed in reasoning models. Rather than hard-coding a list of specific
tokens to avoid, the reward system dynamically penalizes linguistic patterns that historically correlate with off-topic
reasoning, hallucinations, or unnecessary verbosity.
We incentivize shorter, more efficient reasoning paths by rewarding outputs that maintain semantic accuracy while
minimizing token bloat. This encourages the model not only to reach the correct answer but to do so with the most
optimal reasoning route.
To further enhance quality, our scoring mechanism prefers reasoning trajectories that are not just valid, but also minimal,
elegant, and computationally efficient leading to better generalization and reduced inference cost.
This approach helps the model stay within a cognitive budget (e.g., 8k–16k tokens) and avoids “reward hacking” by
maintaining a balance between brevity and correctness. Paths that are correct but overly verbose receive lower rewards
than those which are both correct and concise.

Reasoning Path Quality Reward Function: We define the path-based reward term Rpath as follows:

Rpath (o) = clip (α · (ngold − nused ) + δ, 0, 1) + β · ⊮flexible-optimal (o) (15)

Where:

• nused is the number of reasoning tokens actually used in the output.

• ngold is a reference minimal path length, derived from training data.
• α and δ control the strength and base of length-based reward.
• β boosts solutions that follow paths marked “flexible-optimal,” i.e., correct, elegant, and generalizable.
• ⊮flexible-optimal (o) is an indicator function scoring high-quality reasoning paths selected during validation.

This reward formulation encourages the model to not only be correct, but to generalize through the most streamlined
reasoning paths under a token and compute budget.
These tokens are **not inherently bad**, but their uncontrolled repetition leads to hallucinations, off-topic answers,
and inflated reasoning graphs.

Conclusion
Dr. GRPO + the Quasar 3.0 length modifications balances reward fidelity and computational discipline. It avoids reward
hacking, discourages synthetic verbosity, and promotes reasoning quality over length. It is the backbone of the Quasar
3.0 architecture, enforcing both correctness and elegance in autoregressive decision-making.

7 Conclusion
Quasar 3.0 introduces a structured, multi-stage scaling approach for LLMs that improves reasoning, efficiency, and
adaptability. By integrating SFT, RL, and TTM in a systematic framework, Quasar 3.0 sets a new standard for scalable
AI training.

Resources
For model weights and datasets, keep an eye on our Hugging Face profile: Hugging Face Profile.

Acknowledgments
We would like to express our sincere gratitude to our training partner, Lambda Cloud, for providing us with high-end
GPUs and the support we need. We couldn’t thank them more for making this work possible.
We would like to thank Edvard Castell from Kagari systems for helping us with this project. With ideas and support, we
can’t imagine this project without this help. We appreciate this support and look forward to future work and potential
collaborations again.

7
Quasar 3.0: Golden Formula in Reasoning Models

References
[1] Guidance is All You Need. Available: https://arxiv.org/abs/2412.06822
[2] Dr. GRPO: Generalized Reward Policy Optimization. Available: https://arxiv.org/pdf/2503.20783

Guidance Is All You Need: Advancing Large Language Models With Temperature-Guided Reasoning
No ratings yet
Guidance Is All You Need: Advancing Large Language Models With Temperature-Guided Reasoning
23 pages
GA PSO PT Temp Estimation
No ratings yet
GA PSO PT Temp Estimation
6 pages
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
No ratings yet
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
16 pages
Optimization Theory in Machine Learning
No ratings yet
Optimization Theory in Machine Learning
175 pages
Multi-Head Transformers Provably Learn Symbolic Multi-Step Reasoning Via Gradient Descent
No ratings yet
Multi-Head Transformers Provably Learn Symbolic Multi-Step Reasoning Via Gradient Descent
89 pages
Token Merging For Fast Stable Diffusion: Daniel Bolya Judy Hoffman Georgia Tech
No ratings yet
Token Merging For Fast Stable Diffusion: Daniel Bolya Judy Hoffman Georgia Tech
6 pages
Ijst 2021 1266
No ratings yet
Ijst 2021 1266
15 pages
Operator Tuning in Fuzzy Production Rules Using Neural Networks
No ratings yet
Operator Tuning in Fuzzy Production Rules Using Neural Networks
6 pages
2012 Nikolaos Nikolaou MSC
No ratings yet
2012 Nikolaos Nikolaou MSC
102 pages
Auto-Regressive Next-Token Predictors Are Universal Learners
No ratings yet
Auto-Regressive Next-Token Predictors Are Universal Learners
22 pages
Network Planning Optimization of Long Term Evolution Radio Transmitter Using Taguchi's Method
No ratings yet
Network Planning Optimization of Long Term Evolution Radio Transmitter Using Taguchi's Method
6 pages
Continuous-Time Optimization Using Sub-Threshold Current-Mode Growth Transform Circuits
No ratings yet
Continuous-Time Optimization Using Sub-Threshold Current-Mode Growth Transform Circuits
4 pages
Applsci 11 09041 v2
No ratings yet
Applsci 11 09041 v2
21 pages
Setol Draft
No ratings yet
Setol Draft
117 pages
Ex 3 Simple
No ratings yet
Ex 3 Simple
19 pages
When and Why Test-Time Augmentation Works
No ratings yet
When and Why Test-Time Augmentation Works
9 pages
2309.04860 Approximation Results For Gradient Descent Trained
No ratings yet
2309.04860 Approximation Results For Gradient Descent Trained
69 pages
Atm IEEE BIgData-9
No ratings yet
Atm IEEE BIgData-9
12 pages
Neural Network Thesis Support
100% (3)
Neural Network Thesis Support
5 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Advanced Algorithms and Complexity: The Complexity Class N P
No ratings yet
Advanced Algorithms and Complexity: The Complexity Class N P
4 pages
45 Aa 4 D 3 e
No ratings yet
45 Aa 4 D 3 e
12 pages
This Course Will Be A Quick Overview of The Most Important Elements To Remember When Writing Prompts For Mighty Moo
No ratings yet
This Course Will Be A Quick Overview of The Most Important Elements To Remember When Writing Prompts For Mighty Moo
10 pages
Chapter III Generalized Performance Characteristics of Instruments
No ratings yet
Chapter III Generalized Performance Characteristics of Instruments
31 pages
Tounn: Topology Optimization Using Neural Networks: Aaditya Chandrasekhar Krishnan Suresh
No ratings yet
Tounn: Topology Optimization Using Neural Networks: Aaditya Chandrasekhar Krishnan Suresh
15 pages
22 Promptengg
No ratings yet
22 Promptengg
40 pages
LLM Paper
No ratings yet
LLM Paper
15 pages
Neural Network With Fourier Series-Based Transfer Functions For Filter Modeling
No ratings yet
Neural Network With Fourier Series-Based Transfer Functions For Filter Modeling
4 pages
Toward Faster Methods in Bayesian Unsupervised Learning
No ratings yet
Toward Faster Methods in Bayesian Unsupervised Learning
235 pages
Optimum Design For Arti®cial Neural Networks: An Example in A Bicycle Derailleur System
No ratings yet
Optimum Design For Arti®cial Neural Networks: An Example in A Bicycle Derailleur System
12 pages
Towards Revealing The Mystery Behind Chain of Thought: A Theoretical Perspective
No ratings yet
Towards Revealing The Mystery Behind Chain of Thought: A Theoretical Perspective
38 pages
Lu Princeton 0181D 13623
100% (1)
Lu Princeton 0181D 13623
158 pages
Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks
No ratings yet
Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks
61 pages
Why We Think - Lil'Log
No ratings yet
Why We Think - Lil'Log
32 pages
LLMs Reasoning
No ratings yet
LLMs Reasoning
18 pages
Learning To (Learn at Test Time) - RNNs With Expressive Hidden States
No ratings yet
Learning To (Learn at Test Time) - RNNs With Expressive Hidden States
34 pages
Speculative Thinking
No ratings yet
Speculative Thinking
17 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
5.1. Intro To Machine Learning
No ratings yet
5.1. Intro To Machine Learning
34 pages
Turing Machine: Key Concepts & Importance
No ratings yet
Turing Machine: Key Concepts & Importance
8 pages
Dissertation Section
No ratings yet
Dissertation Section
5 pages
Tulu 3 - Pushing Frontiers in Open Language Model Post-Training
No ratings yet
Tulu 3 - Pushing Frontiers in Open Language Model Post-Training
82 pages
R-Mming Thesis
No ratings yet
R-Mming Thesis
101 pages
Pag 88-92
No ratings yet
Pag 88-92
11 pages
Artigo 3
No ratings yet
Artigo 3
19 pages
Trustworthy Learning Using Uncertain Interpretation of Data
No ratings yet
Trustworthy Learning Using Uncertain Interpretation of Data
228 pages
Computational Complexity
No ratings yet
Computational Complexity
7 pages
Shannon Theory On General Probabilistic Theory
No ratings yet
Shannon Theory On General Probabilistic Theory
18 pages
Project Lab Report Printing
No ratings yet
Project Lab Report Printing
25 pages
Deep Learning Theory Notes
No ratings yet
Deep Learning Theory Notes
127 pages
LMI Approach To Analysis and Control of Takagi-Sugeno Fuzzy Systems With Time Delay
No ratings yet
LMI Approach To Analysis and Control of Takagi-Sugeno Fuzzy Systems With Time Delay
207 pages
IEEE Proof: Author Queries Author Please Answer All Queries
No ratings yet
IEEE Proof: Author Queries Author Please Answer All Queries
11 pages
2022PhD - Princeton - Bridging Theory and Practice in Deep Learning Optimization and Generalization
No ratings yet
2022PhD - Princeton - Bridging Theory and Practice in Deep Learning Optimization and Generalization
540 pages
TV Regularization for Compressive Sensing
No ratings yet
TV Regularization for Compressive Sensing
92 pages
Updated Article BJIT
No ratings yet
Updated Article BJIT
5 pages
Token Turing Machines
No ratings yet
Token Turing Machines
12 pages
Measuring Reasoning Utility in Llms Via Conditional Entropy Reduction
No ratings yet
Measuring Reasoning Utility in Llms Via Conditional Entropy Reduction
11 pages
Probabilistic Models For Classification
No ratings yet
Probabilistic Models For Classification
32 pages
Multivariate Time Series Problem Set 1 Solutions 2017 03-28-18!31!26
No ratings yet
Multivariate Time Series Problem Set 1 Solutions 2017 03-28-18!31!26
5 pages
Fast Fourier Transform (FFT) - Digital Signal Processing Lecture 6
No ratings yet
Fast Fourier Transform (FFT) - Digital Signal Processing Lecture 6
34 pages
Cloud Security Lecture 2
No ratings yet
Cloud Security Lecture 2
48 pages
Generative AI Class 915 241205 181714
No ratings yet
Generative AI Class 915 241205 181714
5 pages
Flattening and Reshaping in CNN
No ratings yet
Flattening and Reshaping in CNN
4 pages
Lecture Notes in Biomathematics
No ratings yet
Lecture Notes in Biomathematics
8 pages
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
No ratings yet
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
35 pages
Prompt Engineering 50 MCQs
No ratings yet
Prompt Engineering 50 MCQs
9 pages
15.053 - Optimization Methods in Management Science (Spring 2007)
No ratings yet
15.053 - Optimization Methods in Management Science (Spring 2007)
6 pages
MTH603 Spring - 2010 - FinalTerm - OPKST
No ratings yet
MTH603 Spring - 2010 - FinalTerm - OPKST
11 pages
Control Sys.2
No ratings yet
Control Sys.2
102 pages
ATC QB Module2&3
No ratings yet
ATC QB Module2&3
3 pages
(Ebook) Introduction To The Design and Analysis of Algorithms by Levitin A. ISBN 9780132316811, 0132316811 Download
100% (4)
(Ebook) Introduction To The Design and Analysis of Algorithms by Levitin A. ISBN 9780132316811, 0132316811 Download
112 pages
Final Exam Study Guide
No ratings yet
Final Exam Study Guide
2 pages
Data Analysis With R: Sai Vaibhavi Tulasi
No ratings yet
Data Analysis With R: Sai Vaibhavi Tulasi
2 pages
Manoranjan PP T
No ratings yet
Manoranjan PP T
22 pages
CENG3300 Lecture 2-1
No ratings yet
CENG3300 Lecture 2-1
21 pages
STD XII Model Question Paper 2023-24
100% (1)
STD XII Model Question Paper 2023-24
13 pages
Operations Management Lesson 2 Trans
No ratings yet
Operations Management Lesson 2 Trans
6 pages
Chat GPT
No ratings yet
Chat GPT
24 pages
2.9 Distributed Deadlock Detection and Resolution
No ratings yet
2.9 Distributed Deadlock Detection and Resolution
31 pages
Embedded Data Security Guide
No ratings yet
Embedded Data Security Guide
4 pages
Deep Learning For Image Denoising
No ratings yet
Deep Learning For Image Denoising
10 pages
Cryptography & Network Security Course
No ratings yet
Cryptography & Network Security Course
2 pages
Optimization Methods: Water Resources Systems Planning and Management - Isbn 92-3-103998-9 - © Unesco 2005
No ratings yet
Optimization Methods: Water Resources Systems Planning and Management - Isbn 92-3-103998-9 - © Unesco 2005
54 pages
1.advanced Tree Structures
No ratings yet
1.advanced Tree Structures
29 pages
MCQ Thermodynamics First Law of Thermodynamics
82% (11)
MCQ Thermodynamics First Law of Thermodynamics
3 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
Decision Theory and Decision Tree Analysis
No ratings yet
Decision Theory and Decision Tree Analysis
43 pages

Quasar 3.0

Uploaded by

Quasar 3.0

Uploaded by

Q UASAR 3.

Eyad Gomaa SILX AI Edvard Castell - Kagari Systems

Partner: Lambda Cloud

Quasar 3.0 DeepSeek-R1-Distill-Qwen-7B Quasar 3.0 (TTM + Qwen-7B) Qwen-7B

Token Temperature Mechanism (TTM) as a Training Framework

3.1 How TTM Works

Token Temperature Mechanism

how many r’s in strawberry

3.2 TTM Algorithm

S(xi ) = α · f (xi ) + β · p(xi ) + γ · C(xi ) (1)

• f (xi ) is the frequency of token xi in the dataset.

The temperature assignment for each token is then:

Effect of TTM Layer on Token Temperature

Token Before Training After Training

4 Overview of the Training Process

4.1 Dataset Structure and Preparation

4.2 Computation of Token Temperature

4.3 Normalization of Token Temperatures

Each token’s temperature is then normalized relative to this maximum value:

4.4 Integration into Attention Weights

4.5 Optimization Through Temperature-Weighted Loss

2. Temperature-Weighted Loss: To reinforce the role of token temperatures, we introduce a temperature-based

The final loss function balances both components:

Golden Formula in RL Training

To solve this issue:

"wait" A token that indicates hesitation or pause.

Graph 1: Input Tokens vs Reasoning Drift + Hallucinations

Advantage Function with Reasoning Efficiency Bias

6 Token Penalty and Reasoning Path Quality

Rpath (o) = clip (α · (ngold − nused ) + δ, 0, 1) + β · ⊮flexible-optimal (o) (15)

• nused is the number of reasoning tokens actually used in the output.

You might also like