Skip to content

Conversation

@hnyls2002
Copy link
Collaborator

@hnyls2002 hnyls2002 commented Oct 9, 2025

Initial results

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=lmsys/sglang-EAGLE-LLaMA3-Instruct-8B
PORT=23333
python -m sglang.launch_server \
    --dtype float16 \
    --model-path $MODEL \
    --attention-backend triton \
    --decode-log-interval 1 \
    --cuda-graph-bs $(seq -s ' ' 1 64) \
    --speculative-algorithm EAGLE \
    --speculative-draft-model $SPEC_MODEL \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --host 127.0.0.1 \
    --port $PORT

using python -m sglang.test.send_one

Beta Eagle:

acc_length=3.01
speed=273.43 token/s

Main Eagle:

acc_length=2.99
speed=246.80 token/s

The current implementation completely separates the beta code path from the main code path, and more features and optimizations will come soon (also a road map and a design doc)

Summary by CodeRabbit

  • New Features

    • Introduces beta "EAGLE v2" speculative decoding with draft/verify overlap, relay of draft state, and a new worker path.
    • Adds export SGLANG_ENABLE_SPEC_V2=1 to opt into beta speculative behavior.
  • Performance

    • Improved throughput/latency via overlapped draft/verify, lazy buffer allocation, and CUDA-graph–aware execution and cache management.
  • Tests

    • Adds end-to-end GSM8K-style evaluations and includes them in the per-commit test suite.

@hnyls2002 hnyls2002 merged commit 20a6c0a into main Oct 12, 2025
139 of 157 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/v2-eagle-worker branch October 12, 2025 03:02
coderabbitai bot added a commit that referenced this pull request Oct 12, 2025
Docstrings generation was requested by @JustinTong0323.

* #11398 (comment)

The following files were modified:

* `python/sglang/srt/layers/attention/base_attn_backend.py`
* `python/sglang/srt/layers/attention/triton_backend.py`
* `python/sglang/srt/layers/logits_processor.py`
* `python/sglang/srt/managers/overlap_utils.py`
* `python/sglang/srt/managers/schedule_batch.py`
* `python/sglang/srt/managers/scheduler.py`
* `python/sglang/srt/managers/scheduler_metrics_mixin.py`
* `python/sglang/srt/managers/scheduler_output_processor_mixin.py`
* `python/sglang/srt/managers/tp_worker.py`
* `python/sglang/srt/model_executor/cuda_graph_runner.py`
* `python/sglang/srt/model_executor/forward_batch_info.py`
* `python/sglang/srt/server_args.py`
* `python/sglang/srt/speculative/eagle_info.py`
* `python/sglang/srt/speculative/eagle_info_v2.py`
* `python/sglang/srt/speculative/eagle_worker_v2.py`
* `python/sglang/srt/speculative/spec_utils.py`
* `test/srt/test_eagle_infer_beta.py`
@Ximingwang-09
Copy link
Contributor

Great Work ! I observed a significant performance improvement using Triton backend, and I would like to know if there are any plan to support more attention backend like fa3 and flashinfer?

@Vincent-zym
Copy link

Could you explain why this plan_stream is used? Will there be any problems if forward_stream is used directly?

@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@sgl-project sgl-project deleted a comment from coderabbitai bot Oct 21, 2025
@hnyls2002
Copy link
Collaborator Author

@Vincent-zym

Could you explain why this plan_stream is used? Will there be any problems if forward_stream is used directly?

To gain more acceleration with a dual stream. Currently disabled by #11724.

@b8zhong
Copy link
Collaborator

b8zhong commented Oct 25, 2025

@Ximingwang-09 , welcome to try #12128.

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants