Beta spec-overlap for EAGLE #11398

hnyls2002 · 2025-10-09T19:30:45Z

Initial results

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=lmsys/sglang-EAGLE-LLaMA3-Instruct-8B
PORT=23333
python -m sglang.launch_server \
    --dtype float16 \
    --model-path $MODEL \
    --attention-backend triton \
    --decode-log-interval 1 \
    --cuda-graph-bs $(seq -s ' ' 1 64) \
    --speculative-algorithm EAGLE \
    --speculative-draft-model $SPEC_MODEL \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --host 127.0.0.1 \
    --port $PORT

using python -m sglang.test.send_one

Beta Eagle:

acc_length=3.01
speed=273.43 token/s

Main Eagle:

acc_length=2.99
speed=246.80 token/s

The current implementation completely separates the beta code path from the main code path, and more features and optimizations will come soon (also a road map and a design doc)

Summary by CodeRabbit

New Features
- Introduces beta "EAGLE v2" speculative decoding with draft/verify overlap, relay of draft state, and a new worker path.
- Adds export SGLANG_ENABLE_SPEC_V2=1 to opt into beta speculative behavior.
Performance
- Improved throughput/latency via overlapped draft/verify, lazy buffer allocation, and CUDA-graph–aware execution and cache management.
Tests
- Adds end-to-end GSM8K-style evaluations and includes them in the per-commit test suite.

@JustinTong0323

Docstrings generation was requested by @JustinTong0323. * #11398 (comment) The following files were modified: * `python/sglang/srt/layers/attention/base_attn_backend.py` * `python/sglang/srt/layers/attention/triton_backend.py` * `python/sglang/srt/layers/logits_processor.py` * `python/sglang/srt/managers/overlap_utils.py` * `python/sglang/srt/managers/schedule_batch.py` * `python/sglang/srt/managers/scheduler.py` * `python/sglang/srt/managers/scheduler_metrics_mixin.py` * `python/sglang/srt/managers/scheduler_output_processor_mixin.py` * `python/sglang/srt/managers/tp_worker.py` * `python/sglang/srt/model_executor/cuda_graph_runner.py` * `python/sglang/srt/model_executor/forward_batch_info.py` * `python/sglang/srt/server_args.py` * `python/sglang/srt/speculative/eagle_info.py` * `python/sglang/srt/speculative/eagle_info_v2.py` * `python/sglang/srt/speculative/eagle_worker_v2.py` * `python/sglang/srt/speculative/spec_utils.py` * `test/srt/test_eagle_infer_beta.py`

Ximingwang-09 · 2025-10-15T07:56:48Z

Great Work ! I observed a significant performance improvement using Triton backend, and I would like to know if there are any plan to support more attention backend like fa3 and flashinfer？

Vincent-zym · 2025-10-21T06:28:36Z

Could you explain why this plan_stream is used? Will there be any problems if forward_stream is used directly?

hnyls2002 · 2025-10-21T13:11:54Z

@Vincent-zym

Could you explain why this plan_stream is used? Will there be any problems if forward_stream is used directly?

To gain more acceleration with a dual stream. Currently disabled by #11724.

b8zhong · 2025-10-25T20:15:24Z

@Ximingwang-09 , welcome to try #12128.

Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>

hnyls2002 added 20 commits October 8, 2025 22:39

remove some unused imports

cd8e093

rename

705f893

allocate_lens for eagle

07d4a98

tiny add type hint

792b24f

move udpate spec metrics

d636fb0

fix typo

c208db7

add tolist of eagle v2

be6ac9a

add token free logic

19df192

add free spec token fc

fd7763e

update

605c6ef

move code from poc branch

f1133f5

copy code

08091d4

future of draft input

211c667

fix circular import

d244acd

copy code

d8c07b0

fix misalign

c876e0b

copy code

bd86d69

move code

f36761a

misc fix

c36dc94

fix future indices

bf38797

hnyls2002 requested review from BBuf, HaiShaw, Ying1123, ch-wan, ispobock, kssteven418, kushanam, merrymercy, xiezhq-hermann and zhyncs as code owners October 9, 2025 19:30

Merge branch 'main' into lsyin/v2-eagle-worker

9f2be89

hnyls2002 merged commit 20a6c0a into main Oct 12, 2025
139 of 157 checks passed

hnyls2002 deleted the lsyin/v2-eagle-worker branch October 12, 2025 03:02

coderabbitai bot mentioned this pull request Oct 12, 2025

📝 Add docstrings to lsyin/v2-eagle-worker #11486

Closed

lixuwei2333 mentioned this pull request Oct 19, 2025

[Feature] Overlap Spec Support #11762

Open

25 tasks

sgl-project deleted a comment from coderabbitai bot Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Beta spec-overlap for EAGLE #11398

Beta spec-overlap for EAGLE #11398

Uh oh!

hnyls2002 commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Ximingwang-09 commented Oct 15, 2025

Uh oh!

Vincent-zym commented Oct 21, 2025

Uh oh!

hnyls2002 commented Oct 21, 2025

Uh oh!

b8zhong commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Beta spec-overlap for EAGLE #11398

Beta spec-overlap for EAGLE #11398

Uh oh!

Conversation

hnyls2002 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Initial results

Summary by CodeRabbit

Uh oh!

Uh oh!

Ximingwang-09 commented Oct 15, 2025

Uh oh!

Vincent-zym commented Oct 21, 2025

Uh oh!

hnyls2002 commented Oct 21, 2025

Uh oh!

b8zhong commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

hnyls2002 commented Oct 9, 2025 •

edited

Loading