Skip to content

Michaelvll/llm-ie-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Engine Benchmarks

This collection of open-source LLM inference engine benchmarks provides fair and reproducible one-line commands to compare different inference engines on identical hardware on different infrastructures -- your own clouds or kubernetes clusters.

We use SkyPilot YAML to ensure consistent and reproducible infrastructure deployment across benchmarks.

cover

Background

When different LLM inference engines start to post their performance numbers [1, 2], it can be confusing to see different numbers. It can due to different configurations, or different hardware setup.

This repo is trying to create a centralized place for these benchmarks to be able to run on the same hardware, with the optimal configurations (e.g. TP, DP, etc.) that can be set by the official teams for those inference engines.

Disclaimer: This repo is created for learning, and is not affiliated with any of the inference engine teams.

Installation

pip install -U "skypilot[nebius]"

Setup cloud credentials. See SkyPilot docs.

Version

The version of the inference engines are as follows:

  • vLLM: 0.8.4
  • SGLang: 0.4.5.post1/0.4.5.post3
  • TRT-LLM: NOT SUPPORTED YET

Benchmark from vLLM

vLLM created a benchmark for vLLM vs SGLang and TRT-LLM.

To run the benchmarks:

Note

vLLM team runs the benchmarks on Nebius H200 machines, so we use --cloud nebius below.

Run the benchmarks

cd ./vllm

# Run the benchmarks for vLLM
sky launch --cloud nebius -c benchmark -d benchmark.yaml
  --env HF_TOKEN
  --env MODEL=deepseek-r1
  --env ENGINE=vllm

# Run the benchmarks for SGLang
# Note: the first run of SGLang will have half of the throughput, likely due to
# the JIT code generation. In the benchmark.yaml, we discard the first run and
# run the sweeps again.
sky launch --cloud nebius -c benchmark -d benchmark.yaml
  --env HF_TOKEN
  --env MODEL=deepseek-r1
  --env ENGINE=sgl


# This is not supported yet
# sky launch --cloud nebius -c benchmark benchmark.yaml
#   --env HF_TOKEN
#   --env MODEL=deepseek-r1
#   --env ENGINE=trt

Automatically stop the cluster after the benchmarks are done:

sky autostop benchmark

Note

If you would like to run the benchmarks on different infrastructure, you can change --cloud to other clouds or your kubernetes cluster: --cloud k8s.

You can also change the model to one of the following: deepseek-r1, qwq-32b, llama-8b, llama-3b, qwen-1.5b.

Benchmark Results for DeepSeek-R1

  • CPU: Intel(R) Xeon(R) Platinum 8468
  • GPU: 8x NVIDIA H200

Output token throughput (tok/s)

Input Tokens Output Tokens vLLM v0.8.4 SGLang v0.4.5.post1
1000 2000 1136.92 1041.14
5000 1000 857.13 821.40
10000 500 441.53 389.84
30000 100 37.07 33.94
sharegpt sharegpt 1330.60 981.47

Logs

Benchmark from SGLang

SGLang created a benchmark for SGLang on random input and output. This benchmark uses the same configurations from it.

Run the benchmarks

cd ./sgl

# Run the benchmarks for SGLang
sky launch --cloud nebius -c benchmark benchmark.yaml \
  --env HF_TOKEN \
  --env ENGINE=sgl

# Run the benchmarks for vLLM
sky launch --cloud nebius -c benchmark benchmark.yaml \
  --env HF_TOKEN \
  --env ENGINE=vllm

Benchmark Results for DeepSeek-R1

  • CPU: Intel(R) Xeon(R) Platinum 8468
  • GPU: 8x NVIDIA H200

Output token throughput (tok/s)

Input Tokens Output Tokens vLLM v0.8.4 (2025-04-14) SGLang v0.4.5.post3 (2025-04-21)
1000 2000 1042.17 1329.14
5000 1000 794.54 951.64
10000 500 436.08 479.69
30000 100 37.76 47.38

Logs

Logs are dumped with sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log.

Output token throughput (tok/s): Using 200 prompts (vs 50 prompts in the official benchmark)

Input Tokens Output Tokens vLLM v0.8.4 (2025-04-14) SGLang v0.4.5.post3 (2025-04-21)
1000 2000 2498.90 3276.05
5000 1000 930.93 1322.31
10000 500 341.70 501.95
30000 100 38.44 47.68

Logs

Contribution

Any contributions from the community are welcome, to tune the versions and configurations for different inference engines, so as to make the benchmarks more accurate and fair.

Final Thoughts

Interestingly, vLLM and SGLang's official benchmark results diverge, even with the same hardware, and the same flags.

Although both of the benchmark scripts try to simulate the real inference scenario, the throughput numbers are very sensitive to the benchmark setup -- even simply changing the number of prompts from 50 to 200 can flip the conclusion for the performance of the two engines.

A better benchmark is in need to provide more insights into the performance of inference engines, while this repo could offer a platform for the community to run the benchmarks in a fair and reproducible way, including same settings, same hardware, etc.

About

A collection of reproducible inference engine benchmarks

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages