[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

ichbinhandsome · 2025-12-14T20:00:13Z

As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.

The following provides a brief overview of this PR:

EAGLE3 is an encoder-decoder based speculative decoding method:

Extracts features from target model at specific layers
Uses feature fusion layer to compress target features
Generates draft tokens with single-layer decoder
Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:

Add LLM_ARCH_EAGLE3 architecture
Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
Add feature extraction from target model layers
Add g_embeddings handling for decoder input
Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
Add --eagle3 flag for speculative-simple example
Add EAGLE3 model conversion in convert_hf_to_gguf.py

EAGLE3 Architecture Overview :

┌─────────────────────────────────────────────────────────────────┐
│                    EAGLE3 Overview                              │
└─────────────────────────────────────────────────────────────────┘

  Target Model          EAGLE3 Encoder         EAGLE3 Decoder
  (LLaMA 8B)              (FC Layer)           (1-layer Transformer)
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│  Generate   │        │  Compress   │        │  Generate Draft │
│  Features   │───────►│  Features   │───────►│  Tokens Fast    │
│  [12288]    │        │  [4096]     │        │  [k tokens]     │
└─────────────┘        └─────────────┘        └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │  Verify Drafts  │
                                              │  with Target    │
                                              └─────────────────┘

How to run EAGLE3 in llama.cpp

Requirements

This PR currently only support two EAGLE3 models:

Step 1: Convert Models to GGUF Format

Convert Target Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
TARGET_MODEL_GGUF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct_bf16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"

Convert EAGLE3 Draft Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
EAGLE3_MODEL_HF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B"
EAGLE3_MODEL_GGUF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf"

python convert_hf_to_gguf.py \
    "${EAGLE3_MODEL_HF}" \
    --outtype f16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${EAGLE3_MODEL_GGUF}"

Step 2: Compile llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Step 3: Run EAGLE3 Speculative Decoding

for prompt in \
    "Write a quicksort algorithm in Python. Write code only." \
    "Explain the Pythagorean theorem" \
    "Plan a 1 day trip to DC"; do
  echo "=== Prompt: $prompt ==="
    ./build/bin/llama-speculative-simple \
      -m "${TARGET_MODEL_GGUF}" \
      -md "${EAGLE3_MODEL_GGUF}" \
      --eagle3 -p "$prompt" -n 256 --draft 8 \
      --temp 0 --top-k 1 --seed 42 -ngl 99 -ngld 99 
done

Performance Evaluation (RTX A6000 48GB)

Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.

LLaMA3.1-Instruct-8B with BF16, its Eagle3 with FP16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	44.5 t/s	146.2 t/s	80.6%	3.28x
Explain the Pythagorean theorem	44.5 t/s	127.1 t/s	77.4%	2.85x
Plan a 1 day trip to DC	44.5 t/s	113.8 t/s	80.9%	2.55x

LLaMA3.1-Instruct-8B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	121.5 t/s	274.4 t/s	92.5%	2.26x
Explain the Pythagorean theorem	121.4 t/s	238.9 t/s	79.4%	1.97x
Plan a 1 day trip to DC	121.4 t/s	196.5 t/s	77.2%	1.62x

LLaMA3.3-Instruct-70B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	15.6 t/s	33.4 t/s	73.6%	2.14x
Explain the Pythagorean theorem	15.6 t/s	37.6 t/s	82.0%	2.41x
Plan a 1 day trip to DC	15.6 t/s	28.8 t/s	69.3%	1.85x

Details of GGML backend modifications (Fixed, no longer needed)

~~In the Eagle3 decoder, two parallel inputs are processed:~~

input_embeds ──→ RMS_NORM ──┐
                            ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ──→ RMS_NORM ──┘

~~When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).~~

~~Solution:~~
~~Use ggml_set_sync() to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.~~

input_embeds ──→ RMS_NORM ──→ [SYNC] ──┐
                                       ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ─────────────→ RMS_NORM ──┘
         (split 1)            |         (split 2)
                           barrier

~~This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.~~

Examples results

Prompt: "Write a quicksort algorithm in Python. Write code only."

Prompt: "Explain the Pythagorean theorem"

Prompt: "Plan a 1 day trip to DC"

Future Steps

Support more Eagle3 models
Currently, Eagle3 is integrated only in llama-speculative-simple, support may need to be extended to other APIs if possible
Support context-dependent tree sampling (tree attention) as described in the Eagle3 paper to improve accept rate
Support batch processing (batch size > 1) with Eagle3 speculative decoding

EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py

ggerganov · 2025-12-15T12:41:16Z

src/models/eagle3.cpp

+
+        // Force a sync point between the two parallel RMS_NORM paths
+        // This prevents buffer reuse issues on GPU (EAGLE3 GPU fix)
+        ggml_set_sync(input_embeds_normed);


This is very strange that you need to do it explicitly.

The ggml_concat operator (like every other ggml op) tracks the input tensors on which it depends. So it should not be possible to get a buffer reuse when the data in the buffer is still pending a computation.

I think this sync should not be necessary and if removing it causes some data corruption, the cause is something else which we should investigate in detail.

Can you confirm that removing this call still causes problems?

I just revalidated this, and without calling ggml_set_sync, the buffer data gets overwritten, causing the acceptance rate to nearly 3-4%. This issue only occurs on the GPU side — when running draft model on the CPU, the acceptance rate remains stable, and ggml_set_sync is not required.

The results buffers from two RMS_NORM operations appear to conflict, with one being overwritten by invalid (garbage) values. ggml_set_sync is used to enforce synchronization between two RMS_NORM operations on GPU side.

I also tried using ggml_set_output for the two RMS_NORM results to avoid buffer overwriting. However, once I set it, the buffer for the concatenated results got overwritten. I then tried setting that as well, but the subsequent Q, K, and V attention result buffers were still being overwritten. It seems there’s an issue with buffer allocation in the scheduler when handling parallel inputs on GPU. So I came up with this method to resolve the issue.

Ok, I am able to reproduce the issue. Looking into this.

The problem is that here you are using the synchronous backend buffer call ggml_backend_tensor_get to get the output logits:

llama.cpp/src/llama-context.cpp

Lines 1247 to 1276 in 8fac4b1

// EAGLE3: Map draft vocab to target vocab

if (model.arch == LLM_ARCH_EAGLE3 && model.d2t) {

static thread_local std::vector<int64_t> eagle3_d2t_map;

static thread_local std::vector<float> eagle3_draft_logits;

const int64_t draft_vocab_size = t_logits->ne[0];

const uint32_t last_idx = n_outputs - 1;

// Load d2t mapping once (on first call)

if (eagle3_d2t_map.empty()) {

eagle3_d2t_map.resize(model.d2t->ne[0]);

ggml_backend_tensor_get(model.d2t, eagle3_d2t_map.data(), 0, eagle3_d2t_map.size() * sizeof(int64_t));

}

// Read only the last token's draft logits

eagle3_draft_logits.resize(draft_vocab_size);

const size_t last_offset = last_idx * draft_vocab_size * sizeof(float);

ggml_backend_tensor_get(t_logits, eagle3_draft_logits.data(), last_offset, draft_vocab_size * sizeof(float));

// Map only the last token's draft logits to target vocab

float * last_logits_out = logits_out + last_idx * n_vocab;

std::fill(last_logits_out, last_logits_out + n_vocab, -std::numeric_limits<float>::infinity());

for (int64_t j = 0; j < draft_vocab_size; j++) {

const int64_t target_id = j + eagle3_d2t_map[j];

GGML_ASSERT(target_id >= 0 && target_id < n_vocab);

last_logits_out[target_id] = eagle3_draft_logits[j];

}

} else {

This is incorrect because the call will get queued in a different stream compared to where the computation runs, so effectively it will not wait for the computation to finish before extracting the result.

To fix this, use the backend async call like this for now:

diff --git a/src/llama-context.cpp b/src/llama-context.cpp index ea6dfaea3..3506edd92 100644 --- a/src/llama-context.cpp +++ b/src/llama-context.cpp @@ -1261,7 +1261,8 @@ int llama_context::decode(const llama_batch & batch_inp) { // Read only the last token's draft logits eagle3_draft_logits.resize(draft_vocab_size); const size_t last_offset = last_idx * draft_vocab_size * sizeof(float); - ggml_backend_tensor_get(t_logits, eagle3_draft_logits.data(), last_offset, draft_vocab_size * sizeof(float)); + ggml_backend_tensor_get_async(backend_res, t_logits, eagle3_draft_logits.data(), last_offset, draft_vocab_size * sizeof(float)); + synchronize(); // Map only the last token's draft logits to target vocab diff --git a/src/models/eagle3.cpp b/src/models/eagle3.cpp index 8987a0c58..43d7a331d 100644 --- a/src/models/eagle3.cpp +++ b/src/models/eagle3.cpp @@ -65,7 +65,7 @@ llm_build_eagle3_decode::llm_build_eagle3_decode(const llama_model & model, cons // Force a sync point between the two parallel RMS_NORM paths // This prevents buffer reuse issues on GPU (EAGLE3 GPU fix) - ggml_set_sync(input_embeds_normed); + //ggml_set_sync(input_embeds_normed); // Apply hidden_norm to g_embeddings ggml_tensor * g_embeddings_normed = build_norm(g_embeddings,

Please confirm that with this patch, you don't need the ggml_set_sync stuff.

Is it actually required to use get_async, or is there just a missing synchronize() after the async graph_compute call?

It's not required - synchronize() before the tensor_get() should also work. It's just that I expect that this synchronization will eventually be moved up the stack, similar to how we don't synchronize when extracting the regular logits data below, and this would have to become tensor_get_async either way.

Thanks @ggerganov for pointing this out! I just updated this PR to fix the bug and remove the ggml_set_sync API. Rebuilt and tested, everything works well.

Great.

Btw, do you mind if I push in the branch directly? I want to do a cleanup pass over the implementation and it would be easier for me to push directly instead of creating PRs to your branch.

Sure, please go ahead. If there’s anything I could help with, just let me know.

ngxson · 2025-12-15T16:23:42Z

Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model.

It can be quite interesting to generalize this features to support other models. I would expect some kind of sub-llama_context that allow both the main and draft models to share the same cgraph, avoiding the need of explicitly passing the intermediate embedding through the host memory.

ggerganov · 2025-12-15T18:39:27Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

ichbinhandsome · 2025-12-16T17:07:40Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR.

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18039: [Speculative decoding] feat: add EAGLE3 speculative decoding support auroralabs-loci/llama.cpp#568

Open

github-actions bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025

github-actions bot mentioned this pull request Dec 15, 2025

Reddit News Daily 2025-12-15 gitlawr/reddit-daily-news#94

Open

ggerganov reviewed Dec 15, 2025

View reviewed changes

fix eagle3 logits sync bug & remove ggml_set_sync()

ac5667d

	// EAGLE3: Map draft vocab to target vocab
	if (model.arch == LLM_ARCH_EAGLE3 && model.d2t) {
	static thread_local std::vector<int64_t> eagle3_d2t_map;
	static thread_local std::vector<float> eagle3_draft_logits;

	const int64_t draft_vocab_size = t_logits->ne[0];
	const uint32_t last_idx = n_outputs - 1;

	// Load d2t mapping once (on first call)
	if (eagle3_d2t_map.empty()) {
	eagle3_d2t_map.resize(model.d2t->ne[0]);
	ggml_backend_tensor_get(model.d2t, eagle3_d2t_map.data(), 0, eagle3_d2t_map.size() * sizeof(int64_t));
	}

	// Read only the last token's draft logits
	eagle3_draft_logits.resize(draft_vocab_size);
	const size_t last_offset = last_idx * draft_vocab_size * sizeof(float);
	ggml_backend_tensor_get(t_logits, eagle3_draft_logits.data(), last_offset, draft_vocab_size * sizeof(float));


	// Map only the last token's draft logits to target vocab
	float * last_logits_out = logits_out + last_idx * n_vocab;
	std::fill(last_logits_out, last_logits_out + n_vocab, -std::numeric_limits<float>::infinity());

	for (int64_t j = 0; j < draft_vocab_size; j++) {
	const int64_t target_id = j + eagle3_d2t_map[j];
	GGML_ASSERT(target_id >= 0 && target_id < n_vocab);
	last_logits_out[target_id] = eagle3_draft_logits[j];
	}
	} else {

[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

Are you sure you want to change the base?

[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

Uh oh!

Conversation

ichbinhandsome commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run EAGLE3 in llama.cpp

Requirements

Step 1: Convert Models to GGUF Format

Step 2: Compile llama.cpp

Step 3: Run EAGLE3 Speculative Decoding

Performance Evaluation (RTX A6000 48GB)

Details of GGML backend modifications (Fixed, no longer needed)

Examples results

Future Steps

Uh oh!

ggerganov Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ichbinhandsome Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ichbinhandsome Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ichbinhandsome Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ichbinhandsome Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

ggerganov commented Dec 15, 2025

Uh oh!

ichbinhandsome commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ichbinhandsome commented Dec 14, 2025 •

edited

Loading

ichbinhandsome Dec 15, 2025 •

edited

Loading

ichbinhandsome Dec 15, 2025 •

edited

Loading

ggerganov Dec 16, 2025 •

edited

Loading

ichbinhandsome Dec 16, 2025 •

edited

Loading