This collection of open-source LLM inference engine benchmarks provides fair and reproducible one-line commands to compare different inference engines on identical hardware on different infrastructures -- your own clouds or kubernetes clusters.
We use SkyPilot YAML to ensure consistent and reproducible infrastructure deployment across benchmarks.
When different LLM inference engines start to post their performance numbers [1, 2], it can be confusing to see different numbers. It can due to different configurations, or different hardware setup.
This repo is trying to create a centralized place for these benchmarks to be able to run on the same hardware, with the optimal configurations (e.g. TP, DP, etc.) that can be set by the official teams for those inference engines.
Disclaimer: This repo is created for learning, and is not affiliated with any of the inference engine teams.
pip install -U "skypilot[nebius]"Setup cloud credentials. See SkyPilot docs.
The version of the inference engines are as follows:
- vLLM: 0.8.4
- SGLang: 0.4.5.post1/0.4.5.post3
- TRT-LLM: NOT SUPPORTED YET
vLLM created a benchmark for vLLM vs SGLang and TRT-LLM.
To run the benchmarks:
Note
vLLM team runs the benchmarks on Nebius H200 machines, so we use --cloud nebius below.
cd ./vllm
# Run the benchmarks for vLLM
sky launch --cloud nebius -c benchmark -d benchmark.yaml
--env HF_TOKEN
--env MODEL=deepseek-r1
--env ENGINE=vllm
# Run the benchmarks for SGLang
# Note: the first run of SGLang will have half of the throughput, likely due to
# the JIT code generation. In the benchmark.yaml, we discard the first run and
# run the sweeps again.
sky launch --cloud nebius -c benchmark -d benchmark.yaml
--env HF_TOKEN
--env MODEL=deepseek-r1
--env ENGINE=sgl
# This is not supported yet
# sky launch --cloud nebius -c benchmark benchmark.yaml
# --env HF_TOKEN
# --env MODEL=deepseek-r1
# --env ENGINE=trtAutomatically stop the cluster after the benchmarks are done:
sky autostop benchmarkNote
If you would like to run the benchmarks on different infrastructure, you can change --cloud to other clouds or your kubernetes cluster: --cloud k8s.
You can also change the model to one of the following: deepseek-r1, qwq-32b, llama-8b, llama-3b, qwen-1.5b.
- CPU: Intel(R) Xeon(R) Platinum 8468
- GPU: 8x NVIDIA H200
Output token throughput (tok/s)
| Input Tokens | Output Tokens | vLLM v0.8.4 | SGLang v0.4.5.post1 |
|---|---|---|---|
| 1000 | 2000 | 1136.92 | 1041.14 |
| 5000 | 1000 | 857.13 | 821.40 |
| 10000 | 500 | 441.53 | 389.84 |
| 30000 | 100 | 37.07 | 33.94 |
| sharegpt | sharegpt | 1330.60 | 981.47 |
Logs
- vLLM logs: vllm_deepseek-r1.log
- SGLang logs: sgl_deepseek-r1.log
Logs are dumped with
sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log.
SGLang created a benchmark for SGLang on random input and output. This benchmark uses the same configurations from it.
cd ./sgl
# Run the benchmarks for SGLang
sky launch --cloud nebius -c benchmark benchmark.yaml \
--env HF_TOKEN \
--env ENGINE=sgl
# Run the benchmarks for vLLM
sky launch --cloud nebius -c benchmark benchmark.yaml \
--env HF_TOKEN \
--env ENGINE=vllm- CPU: Intel(R) Xeon(R) Platinum 8468
- GPU: 8x NVIDIA H200
Output token throughput (tok/s)
| Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) |
|---|---|---|---|
| 1000 | 2000 | 1042.17 | 1329.14 |
| 5000 | 1000 | 794.54 | 951.64 |
| 10000 | 500 | 436.08 | 479.69 |
| 30000 | 100 | 37.76 | 47.38 |
Logs
- vLLM logs: vllm-deepseek-r1.log
- SGLang logs: sgl-0.4.5.post3-deepseek-r1.log
Logs are dumped with sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log.
Output token throughput (tok/s): Using 200 prompts (vs 50 prompts in the official benchmark)
| Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) |
|---|---|---|---|
| 1000 | 2000 | 2498.90 | 3276.05 |
| 5000 | 1000 | 930.93 | 1322.31 |
| 10000 | 500 | 341.70 | 501.95 |
| 30000 | 100 | 38.44 | 47.68 |
Logs
- vLLM logs: vllm-deepseek-r1-200.log
- SGLang logs: sgl-0.4.5.post3-deepseek-r1-200.log
Any contributions from the community are welcome, to tune the versions and configurations for different inference engines, so as to make the benchmarks more accurate and fair.
Interestingly, vLLM and SGLang's official benchmark results diverge, even with the same hardware, and the same flags.
Although both of the benchmark scripts try to simulate the real inference scenario, the throughput numbers are very sensitive to the benchmark setup -- even simply changing the number of prompts from 50 to 200 can flip the conclusion for the performance of the two engines.
A better benchmark is in need to provide more insights into the performance of inference engines, while this repo could offer a platform for the community to run the benchmarks in a fair and reproducible way, including same settings, same hardware, etc.