Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1299 commits
Select commit Hold shift + click to select a range
0ff8ebb
[V0 Deprecation] Remove async_output_proc, preemption mode, delay fac…
WoosukKwon Sep 21, 2025
c438b29
feat: Enable engine-level arguments with speculators models (#25250)
rahul-tuli Sep 21, 2025
1c3ffdb
[V0 Deprecation] Remove V0 sampling metadata (#25345)
WoosukKwon Sep 21, 2025
af7dfb0
[Perf] Further optimization for Qwen3-VL `fast_pos_embed_interpolate`…
Isotr0py Sep 21, 2025
bc6e542
Remove V0 attention backends (#25351)
WoosukKwon Sep 21, 2025
04d3752
[Bugfix][V0 Deprecation][CI] use async mock and await for async metho…
KKSK-DON Sep 21, 2025
5aeb925
Multimodal - audio tests (#25285)
debroy-rh Sep 21, 2025
7b57a43
[Model] Support Dots OCR (#24645)
ywang96 Sep 22, 2025
793be8d
[Docs] GSM8K Accuracy Evaluation doc update (#25360)
david6666666 Sep 22, 2025
0eecb31
[Bugfix] Fix hermes tool parser handling of non-string argument types…
david6666666 Sep 22, 2025
6d0b827
[V0 Deprecation] Remove V0-only methods in multi-modal registry (#25362)
DarkLight1337 Sep 22, 2025
f92d952
[V0 Deprecation] Remove `MultiModalPlaceholderMap` (#25366)
DarkLight1337 Sep 22, 2025
21467f9
Enable Eagle3 speculative decoding for GPT-OSS model (#25246)
eldarkurtic Sep 22, 2025
a66d131
[TPU][Bugfix][CI] Fix broken tests/build dependency (#25255)
NickLucche Sep 22, 2025
4cf71cc
[TPU] Deprecate `xm.mark_step` in favor of ``torch_xla.sync` (#25254)
NickLucche Sep 22, 2025
b6f01bd
refactor: abstract graph mode support into platform interface (#25161)
yiz-liu Sep 22, 2025
417a164
[Misc] Remove unused encoder-decoder error strings (#25374)
DarkLight1337 Sep 22, 2025
64c824c
Make pickle import check fast (#25379)
hmellor Sep 22, 2025
3d2c56b
Make `mypy` behave like a proper pre-commit hook (#25313)
hmellor Sep 22, 2025
ac24388
[Kernel] MI-300X triton moe configs (#23445)
Sara-KS Sep 22, 2025
c10101a
[Bugfix] Fix several issues with p2p xPyD in GET type (#23993)
Csrayz Sep 22, 2025
175811e
[V1][Attention] Split triton_attn in triton-only and rocm specific ba…
bringlein Sep 22, 2025
06a4133
[EPLB] Reduce EPLB Inference Overhead (#24573)
abmfy Sep 22, 2025
cfbee3d
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in en…
Daisy-Ma-coder Sep 22, 2025
1d7f95b
[Compiler] Disable Inductor standalone compile by default (#25391)
ElizaWszola Sep 22, 2025
239ef0c
[CI Failure] Fix fp8 kv cache on <SM90 (#25396)
mgoin Sep 22, 2025
922979b
[DP] support torchrun external launcher with Data Parallelism (#24899)
luccafong Sep 22, 2025
8d0ee5a
[misc] Remove RFC review hours reference (#25416)
simon-mo Sep 22, 2025
d5e0fca
[torch.compile] Cleanup compilation tests and custom passes, add debu…
ProExpertProg Sep 22, 2025
8db2939
[KV offload][5/N] Add `CPUOffloadingSpec` (#24251)
orozery Sep 22, 2025
f552d5e
[CI/Build] Skip Qwen3-VL initialization tests until models are actual…
DarkLight1337 Sep 22, 2025
8bed179
[TPU] update torch_xla dependency for PyPI compatibility (#25278)
jcyang43 Sep 22, 2025
45d7d85
[Frontend] Responses API MCP tools for built in tools and to pass thr…
alecsolder Sep 22, 2025
d588cd2
[Bugfix] fix custom op test (#25429)
ProExpertProg Sep 23, 2025
f31ff87
[Core] Drop overly aggressive whisper assertion (#25408)
russellb Sep 23, 2025
0901970
[Bugfix] Fix missing `clear_connector_metadata` (#25397)
NickLucche Sep 23, 2025
ac0048c
[BugFix] [DP/EP] Fix slow execution when BS <= DP (#25407)
MatthewBonanni Sep 23, 2025
0b7bed9
[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…
alexm-redhat Sep 23, 2025
9949aa2
[Perf] Apply torch.compile for `per_block_cast_to_fp8` (#24611)
yewentao256 Sep 23, 2025
6fa78d8
[V0 deprecation] Remove platform v1 controling interface (#25410)
Isotr0py Sep 23, 2025
c625f90
[V0 deprecation] Remove `_set_default_args_v0` function (#25409)
Isotr0py Sep 23, 2025
4741239
[Bug] Fix Long Context OOM Issue (#25290)
yewentao256 Sep 23, 2025
fc97733
[feat] Support MRoPE + YaRN (#25384)
JJJYmmm Sep 23, 2025
f225ea7
[XPU] Fix `compile_size` is `None` case. (#25433)
jikunshang Sep 23, 2025
eea1783
[benchmarks]allow skip ready check for bench serve (#25420)
luccafong Sep 23, 2025
78237e4
[Bugfix] Remove contiguous output req for context parallel MLA (#25414)
mgoin Sep 23, 2025
fafbe11
[Docs] Fix griffe warnings in vllm/lora/ops (#25369)
windsonsea Sep 23, 2025
e8db44f
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP (#24588)
varun-sundar-rabindranath Sep 23, 2025
5774b0a
[NIXL][OOT platform] support nixl_connector with oot platform and oth…
xuechendi Sep 23, 2025
c98be0a
[Model] Enable DP for ViT in Qwen2-VL (#25445)
DarkLight1337 Sep 23, 2025
ba8d216
Handle triton kernel import exception (#25319)
minosfuture Sep 23, 2025
9383cd6
[Frontend] Add a new xml-based tool parser for qwen3-coder (#25028)
Zhikaiiii Sep 23, 2025
babad6e
[Misc] Move DP for ViT code inside model executor dir (#25459)
DarkLight1337 Sep 23, 2025
4322c55
[Test]: Hermes tool parser stream output error in Qwen3 case (#25203)
ahartel Sep 23, 2025
231c2c6
[Bugfix] Fix idefics3 `tie_word_embeddings` (#25454)
Isotr0py Sep 23, 2025
273690a
[Core] Optimize LoRA weight loading (#25403)
jeejeelee Sep 23, 2025
0d9fe26
[docs] Benchmark Serving Incorrect Arg (#25474)
vllmellm Sep 23, 2025
b6a136b
[CI/Build] Fix disabled v1 attention backend selection test (#25471)
Isotr0py Sep 23, 2025
61d1b35
[BugFix] Register expert_map as named buffer for wake_up and sleep (#…
wuxibin89 Sep 23, 2025
f05a4f0
[P/D] Support NIXL connector to disconnect during a clean shutdown (#…
chaunceyjiang Sep 23, 2025
da5e7e4
[Docs] NixlConnector quickstart guide (#24249)
panpan0000 Sep 23, 2025
4c966e4
[XPU] Fix MOE DP accuracy issue on XPU (#25465)
faaany Sep 23, 2025
2c58742
[UX] Change kv-cache-memory log level to debug (#25479)
mgoin Sep 23, 2025
a903669
[V1] Remove V0 code paths for Hybrid models (#25400)
tdoublep Sep 23, 2025
cc1dc7e
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support…
LucasWilkinson Sep 23, 2025
875d6de
Add backward compatibility for `GuidedDecodingParams` (#25422)
hmellor Sep 23, 2025
f11e3c5
[Kernels] Support blocked fp8 quantization for compressed tensors MoE…
bnellnm Sep 23, 2025
2357480
[BugFix] Fix UB in per_token_group_quant.cu (#24913)
rivos-shreeasish Sep 23, 2025
846197f
[Log] Optimize kv cache memory log from Bytes to GiB (#25204)
yewentao256 Sep 23, 2025
527821d
Use macro guard CUDA functions for back compatibility in grouped_topk…
minosfuture Sep 23, 2025
100b630
[V1][Kernel] Add triton implementation for `reshape_and_cache_flash` …
bringlein Sep 23, 2025
24e8222
[Misc] Reduce initialization time of auto_tune (#23682)
wdhongtw Sep 23, 2025
867ecdd
[Spec Decode][CI] Add e2e test for `examples/spec_decode.py` and prev…
ekagra-ranjan Sep 23, 2025
5abb117
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_ran…
jeejeelee Sep 23, 2025
a3a7828
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 (#24988)
amd-hhashemi Sep 23, 2025
8c1c81a
[core] add nccl symmetric memory for all reduce (#24532)
Amir-19 Sep 23, 2025
6340025
[Performance] Move apply_w8a8_block_fp8_linear to an op class (#24666)
ElizaWszola Sep 23, 2025
24fab45
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…
mgoin Sep 23, 2025
d5944d5
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue…
jiahanc Sep 23, 2025
a8ffc4f
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible wit…
mgoin Sep 23, 2025
8bdd8b5
Enable symmetric memory all reduce by default only enabling for TP (#…
ilmarkov Sep 23, 2025
8b8a8af
[CI] Fix Pre-commit Issue (#25497)
yewentao256 Sep 23, 2025
c828d1b
[Bugfix] gpt-oss container tool output bug (#25485)
alecsolder Sep 23, 2025
08275ec
[Build] Update Xgrammar to 0.1.25 (#25467)
chaunceyjiang Sep 23, 2025
690f948
[Bugfix] Fix for the import error from #24588 (#25481)
gshtras Sep 23, 2025
ae00292
[CI/Build] Fix and re-enable v1 PP test on CI (#25496)
Isotr0py Sep 23, 2025
4f8c4b8
[Core] Use KVCacheBlock as much as possible instead of dict[block_id,…
Jialin Sep 23, 2025
969b4da
[V0 Deprecation] Remove placeholder attn (#25510)
tdoublep Sep 23, 2025
eca7be9
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA…
rouchenzi Sep 23, 2025
4f2954f
Fix triton_reshape_and_cache_flash.py triton import (#25522)
mgoin Sep 23, 2025
95bc60e
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI (#25428)
qandrew Sep 23, 2025
7361ab3
Remove redundant mutates_args and dispatch_key for direct_register_cu…
mgoin Sep 23, 2025
abad204
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory …
kouroshHakha Sep 23, 2025
c85d75c
Add `VLLM_NVTX_SCOPES_FOR_PROFILING=1` to enable `nvtx.annotate` scop…
chelsea0x3b Sep 23, 2025
5e25b12
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configuration…
tdoublep Sep 23, 2025
bde2a1a
[ROCm] Small functional changes for gptoss (#25201)
jpvillam-amd Sep 23, 2025
e0b24ea
[Perf] Increase default max splits for FA3 full cudagraphs (#25495)
LucasWilkinson Sep 23, 2025
1210e4d
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1…
alexm-redhat Sep 23, 2025
dc464a3
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for u…
LucasWilkinson Sep 24, 2025
7ad5e50
Improve output when failing json.loads() on structured output test (#…
dougbtv Sep 24, 2025
0d235b8
Add CUTLASS FP8 MOE benchmark scripts and kernel config (#25302)
chenxi-yang Sep 24, 2025
88d7bdb
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_wei…
yewentao256 Sep 24, 2025
c8bde93
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used to…
hao-aaron Sep 24, 2025
be0bb56
[Model] Support SeedOss Reason Parser (#24263)
LuYanFCP Sep 24, 2025
d06b5a9
[V1][Metrics] Add per-request TPOT histogram (#24015)
baxingpiaochong Sep 24, 2025
1983609
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (#…
benchislett Sep 24, 2025
de94289
[Core] Support weight_loader_v2 for `UnquantizedLinearMethod` (#23036)
kylesayrs Sep 24, 2025
bf68fd7
[Compile] Fix AMD Compile Error (#25518)
yewentao256 Sep 24, 2025
9df8da5
[BugFix] Fix MLA assert with CUTLASS MLA (#25478)
LucasWilkinson Sep 24, 2025
359d293
[fix]: add Arm 4bit fused moe support (#23809)
nikhil-arm Sep 24, 2025
77d9069
[KV sharing] Re-land Gemma3n model changes from #22628 (#24357)
shfoss Sep 24, 2025
c30b405
[Spec Decode] Enable FlashInfer Spec Decoding (#25196)
benchislett Sep 24, 2025
d747c2e
[Perf] Fix jit compiles at runtime of fla gated delta rule (#25432)
chelsea0x3b Sep 24, 2025
5caaeb7
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls (#2…
bbrowning Sep 24, 2025
190c45a
[TPU][Bugfix] fix the missing apply_model in tpu worker (#25526)
yaochengji Sep 24, 2025
fed8a9b
[Misc] Retry HF processing if "Already borrowed" error occurs (#25535)
DarkLight1337 Sep 24, 2025
1cbcfb9
[Bugfix][CPU] Skip unsupported custom op register on CPU (#25534)
bigPYJ1151 Sep 24, 2025
27ec3c7
[CI/Build] Fix v1 OOT registration test (#25547)
Isotr0py Sep 24, 2025
6488f34
[Misc]] Move processing context to multimodal directory (#25548)
DarkLight1337 Sep 24, 2025
77a7fce
[CI/Build] add nightly prime-rl integration tests (#25207)
Jackmin801 Sep 24, 2025
2e19a84
[V0 Deprecation] Remove max_seq_len_to_capture (#25543)
WoosukKwon Sep 24, 2025
2338daf
[BugFix] Potential Fix for FA3 full-cudagraph IMA (#25490)
LucasWilkinson Sep 24, 2025
b67dece
[misc] update the warning message (#25566)
youkaichao Sep 24, 2025
42488da
[Bugfix] Fix dummy video number of frames calculation (#25553)
ywang96 Sep 24, 2025
58c360d
[Bug] fix import and unit test (#25558)
jmkuebler Sep 24, 2025
1642995
[Benchmark] Fix regression in structured output benchmark (#25500)
russellb Sep 24, 2025
b106890
[docs] fix nixl kv_connector_extra_config.backends key (#25565)
panpan0000 Sep 24, 2025
e18b714
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools …
taohui Sep 24, 2025
8938774
Move `DeviceConfig`, `ObservabilityConfig`, `SpeechToTextConfig` to t…
hmellor Sep 24, 2025
9313be5
[Misc] Improve type annotations for jsontree (#25577)
DarkLight1337 Sep 24, 2025
487745f
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly…
gshtras Sep 24, 2025
302eb94
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order (#…
gshtras Sep 24, 2025
d83f3f7
Fixes and updates to bench_per_token_quant_fp8 (#25591)
mgoin Sep 24, 2025
2dda3e3
[Bugfix] add cache model when from object storage get model (#24764)
lengrongfu Sep 24, 2025
54e42b7
Support mnnvl all2allv from Flashinfer (#21003)
wenscarl Sep 24, 2025
f84a472
Suppress benign cuBLAS warning when capturing cudagraphs with DBO (#2…
SageMoore Sep 24, 2025
8c85305
[Docs] Enable `fail_on_warning` for the docs build in CI (#25580)
hmellor Sep 24, 2025
e6750d0
[V0 Deprecation] Remove unused classes in attention (#25541)
WoosukKwon Sep 24, 2025
fea8006
[Logging] Improve log for when DeepEP HT disables CUDA Graphs (#25531)
tlrmchlsmth Sep 24, 2025
6160ba4
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expe…
djmmoss Sep 24, 2025
1f29141
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor (#25517)
yewentao256 Sep 24, 2025
e7f27ea
Improve `--help` for enhanced user experience (#24903)
hmellor Sep 24, 2025
5c1e496
[MISC] replace c10::optional with std::optional (#25602)
842974287 Sep 24, 2025
52d0cb8
[Model] Improve DotsOCRForCausalLM (#25466)
jeejeelee Sep 24, 2025
05c1948
[Kernel] Support DCP for Triton backend (#25132)
frank-wei Sep 25, 2025
4492e3a
[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_function` ca…
yewentao256 Sep 25, 2025
90b139c
Enable Fbgemm NVFP4 on Dense models (#25609)
samanamp Sep 25, 2025
845adb3
[Model] Add LongCat-Flash (#23991)
OftenDream Sep 25, 2025
c85be1f
optimize: eliminate duplicate split_enc_dec_inputs calls (#25573)
nicole-lihui Sep 25, 2025
a676e66
[Bugfix] fix apply_temperature to avoid nan in probs (#24734)
courage17340 Sep 25, 2025
755ed7b
[Misc] Simplify PoolerOutput and move to `v1/outputs` (#25629)
DarkLight1337 Sep 25, 2025
bc092ea
Map CwmForCausalLM to llama and LlamaForCausalLM (#25611)
jacobkahn Sep 25, 2025
af4ee63
typo: remove duplicate `is` (#25641)
nicole-lihui Sep 25, 2025
1260180
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class…
tlrmchlsmth Sep 25, 2025
393de22
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and D…
fadara01 Sep 25, 2025
7be9ffc
[Misc] Fix Qwen3-VL `video_grid_thw` typing (#25646)
ywang96 Sep 25, 2025
3c2b2cc
[Bugfix] Add triton.language.tensor placeholder (#25649)
adobrzyn Sep 25, 2025
17b4c66
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video prof…
Isotr0py Sep 25, 2025
12c1287
[mypy] Further improve MM type annotations (#25654)
DarkLight1337 Sep 25, 2025
eaeca3c
[Bugfix] Parse SpeculativeConfig Error (#25142)
yyzxw Sep 25, 2025
7f570f1
[V0 deprecation] Remove unreachable model_config.supported_tasks (#25…
noooop Sep 25, 2025
70fbdb2
Add backward compatibility for `guided_...` API (#25615)
hmellor Sep 25, 2025
0bcc3a1
[CI/Build] Fix flaky entrypoints test (#25663)
DarkLight1337 Sep 25, 2025
d2af674
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643)
jikunshang Sep 25, 2025
1e9a77e
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)
langc23 Sep 25, 2025
2f17117
[mypy] Fix wrong type annotations related to tuple (#25660)
DarkLight1337 Sep 25, 2025
6c340da
[misc] log info messages by default for hanging / busy / idle (#25627)
youkaichao Sep 25, 2025
69a8c8e
[torch.compile] Make Query Quantization Fusable (#24914)
jmkuebler Sep 25, 2025
eb32335
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#…
bigPYJ1151 Sep 25, 2025
532a6cf
[ux] Switch a warning to debug about a pytorch fallback (#23750)
russellb Sep 25, 2025
03858e6
[Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644)
Isotr0py Sep 25, 2025
0754ac4
[Misc] Remove cruft file in repo (#25678)
NickLucche Sep 25, 2025
2e5df88
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning …
tlrmchlsmth Sep 25, 2025
e04a1b6
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin…
AlonKejzman Sep 25, 2025
916bd92
Revert "[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_func…
mgoin Sep 25, 2025
13cc7f5
[BugFix] Fix DBO hang (#25625)
LucasWilkinson Sep 25, 2025
b8d9e4a
[Model] Add optional parameter to reasoning parser constructor (#25554)
taohui Sep 25, 2025
0ea80c8
[Model] Define `merge_by_field_config` MM interface (#25676)
DarkLight1337 Sep 25, 2025
71b25b0
[V0 deprecation] Clean up V0 fallback in compilation config (#25675)
Isotr0py Sep 25, 2025
3468f17
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend name…
MatthewBonanni Sep 25, 2025
0fa673a
[V0 deprecation] Clean up LoRA (#25686)
jeejeelee Sep 25, 2025
6b0fcbb
[Misc] Simplify `test_argsort_mm_positions` (#25690)
DarkLight1337 Sep 25, 2025
3d54bdc
[Optimization] Streamline `InputPreprocessor` (#25702)
DarkLight1337 Sep 25, 2025
89fa54e
[Optimization] Use a cheaper cache key in `get_model_architecture` (#…
DarkLight1337 Sep 25, 2025
e71b8e2
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)
ekagra-ranjan Sep 25, 2025
8c435c9
[Core] Enable command line logging for LLMEngine (#25610)
zhuohan123 Sep 25, 2025
57329a8
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708)
tomeras91 Sep 25, 2025
081b559
Fix routing_bias dtype (#25711)
wenscarl Sep 25, 2025
9fe4c2b
[Refactor] Remove DeepGEMM OP Register (#25710)
yewentao256 Sep 26, 2025
8b77328
[Misc] Don't log shm dequeue delay warning on worker side (#25720)
njhill Sep 26, 2025
53a3084
Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135)
maleksan85 Sep 26, 2025
13dd93c
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701)
russellb Sep 26, 2025
983056e
[Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721)
njhill Sep 26, 2025
392edee
EVS Support (Video tokens pruning) (#22980)
BloodAxe Sep 26, 2025
3edf87d
[CI/Build] fix doc build warning: Failed to get 'name: description' p…
yitingdc Sep 26, 2025
e84e073
fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid h…
qthequartermasterman Sep 26, 2025
d48f4d6
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds…
qthequartermasterman Sep 26, 2025
52621c8
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300…
xaguilar-amd Sep 26, 2025
6e30010
fix: print outputt offline_inference/base/chat.py example (#25744)
Iceber Sep 26, 2025
99b3a50
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and …
sighingnow Sep 26, 2025
dd70437
Remove cuda hard-code in compute_causal_conv1d_metadata (#25555)
wxsIcey Sep 26, 2025
19f76ee
[misc] refactor speculative config (#25657)
yyzxw Sep 26, 2025
dfb9af2
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk…
SageMoore Sep 26, 2025
b03b1b9
Support LongCat-Flash-Chat tool call (#24083)
Xu-Wenqing Sep 26, 2025
633f943
[Doc] Update Batch-level DP docs (#25757)
DarkLight1337 Sep 26, 2025
2b6b1d7
[Model] Mamba2 varlen refactor (#21467)
cyang49 Sep 26, 2025
2827b3f
[CI] Fix test_shared_storage_connector_hashes (#25748)
chaunceyjiang Sep 26, 2025
fe6b19c
[Bugfix] Properly abort pooling request. (#25734)
noooop Sep 26, 2025
bc9d7b5
[CI/Build] Split up Distributed Tests (#25572)
DarkLight1337 Sep 26, 2025
db1e42f
[CI/Build] Fix some V1 tests not being run (#25569)
DarkLight1337 Sep 26, 2025
d4d9899
[Quantization] Add field to skip unquantized modules for GPTQ config …
Isotr0py Sep 26, 2025
984d184
[BugFix] Fix using `dbo_decode_token_threshold` always (and ignoring …
LucasWilkinson Sep 26, 2025
8d52f2b
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility i…
eicherseiji Sep 26, 2025
56aafa8
[Misc] fix unique_filepath (#25732)
ZJY0516 Sep 26, 2025
33f6aaf
Eagle3 that supports the Minicpm3 model (#24243)
LDLINGLINGLING Sep 26, 2025
b761df9
[Doc]: improve CPU(x86) build-wheel-from-source section (#25617)
brokedba Sep 26, 2025
bb79c4d
Reduce the Cuda Graph memory footprint when running with DBO (#25779)
SageMoore Sep 26, 2025
ee10d7e
Validate API tokens in constant time (#25781)
russellb Sep 27, 2025
04c2b26
Add filtering for chat template kwargs (#25794)
russellb Sep 27, 2025
32335c8
Add option to restrict media domains (#25783)
russellb Sep 27, 2025
c2fa2d4
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (#25788)
yewentao256 Sep 27, 2025
5aa5811
[CI] Fix FlashInfer AOT in release docker image (#25730)
mgoin Sep 26, 2025
26a7a33
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)
tlrmchlsmth Sep 27, 2025
b14773b
[Bugfix][NIXL] Fix Async Scheduler timeout issue (#25808)
NickLucche Sep 27, 2025
6de3d43
[MM] Optimize memory profiling for scattered multimodal embeddings (#…
ywang96 Sep 28, 2025
19e7ab7
[Bugfix] Fix Qwen3-VL regression from #24982 (#25814)
ywang96 Sep 28, 2025
4c34704
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurab…
Isotr0py Sep 28, 2025
09c2cbc
[Bugfix] fix Qwen3VLMoe load when pp > 1 (#25838)
JJJYmmm Sep 28, 2025
8ce5d31
[P/D] NIXL Updates (#25844)
robertgshaw2-redhat Sep 29, 2025
ab5b645
[Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851)
ywang96 Sep 29, 2025
9471879
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (#25909)
yewentao256 Sep 30, 2025
03df0fb
[BugFix] Fix DP/EP hang (#25906)
LucasWilkinson Sep 30, 2025
b3230e1
[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896)
zyongye Sep 30, 2025
d0b178c
[NIXL] Add support for MLA caches with different latent dim (#25902)
NickLucche Sep 30, 2025
83f3c9b
[bugfix][deepseek] fix flashmla kernel selection (#25956)
youkaichao Sep 30, 2025
c3dfb0f
[Bench] Add DeepSeekV32 to MoE benchmark (#25962)
jeejeelee Sep 30, 2025
c214d69
[spec decode] Consolidate speculative decode method name for MTP (#25…
zixi-qi Sep 26, 2025
bab9231
[Model] MTP fallback to eager for DeepSeek v32 (#25982)
luccafong Oct 1, 2025
a1825fe
[MM] Add text-only mode for Qwen3-VL (#26000)
ywang96 Oct 1, 2025
febb688
[Bugfix] Fix `__syncwarp` on ROCM (#25996)
zhewenl Oct 1, 2025
e4beabd
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988)
LucasWilkinson Oct 1, 2025
ebce361
[BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)
LucasWilkinson Oct 1, 2025
c536881
[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
LucasWilkinson Oct 1, 2025
05bf0c5
Update base image to 22.04 (jammy) (#26065)
huydhn Oct 2, 2025
6040e0b
[BugFix] Fix FI accuracy issue when used for MLA prefill (#26063)
LucasWilkinson Oct 2, 2025
9d9a2b7
[Small] Prevent bypassing media domain restriction via HTTP redirects…
huachenheli Oct 2, 2025
c75c2e7
[Deepseek v3.2] Support indexer prefill chunking (#25999)
heheda12345 Oct 2, 2025
d100776
[Bugfix] Disable cascade attention with FlashInfer (#26130)
mgoin Oct 2, 2025
f71952c
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (…
tlrmchlsmth Oct 3, 2025
b8b302c
Update CUDA architecture list in build pipeline for 12.9.1 wheels (#2…
wseaton Oct 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
8 changes: 4 additions & 4 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 450 MiB
# Note that we have 800 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/6326 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 450))


def print_top_10_largest_files(zip_file):
Expand Down
23 changes: 21 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
<a href="../{x86_wheel_html_escaped}">{x86_wheel}</a><br/>
<a href="../{arm_wheel_html_escaped}">{arm_wheel}</a><br/>
</body>
</html>
"""
Expand All @@ -21,7 +22,25 @@

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# sync the abi tag with .buildkite/scripts/upload-wheels.sh
if "x86_64" in filename:
x86_wheel = filename
arm_wheel = filename.replace("x86_64", "aarch64").replace(
"manylinux1", "manylinux2014"
)
elif "aarch64" in filename:
x86_wheel = filename.replace("aarch64", "x86_64").replace(
"manylinux2014", "manylinux1"
)
arm_wheel = filename
else:
raise ValueError(f"Unsupported wheel: {filename}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
template.format(
x86_wheel=x86_wheel,
x86_wheel_html_escaped=x86_wheel.replace("+", "%2B"),
arm_wheel=arm_wheel,
arm_wheel_html_escaped=arm_wheel.replace("+", "%2B"),
)
)
12 changes: 0 additions & 12 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml

This file was deleted.

1 change: 0 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ When run, benchmark script generates results under `benchmark/results` folder, a
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This benchmark aims to:

Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.

Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
Latest reproduction guide: [github issue link](https://github.com/vllm-project/vllm/issues/8176)

## Setup

Expand All @@ -17,7 +17,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- *NOTE: we use r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
Expand Down
148 changes: 120 additions & 28 deletions .buildkite/nightly-benchmarks/scripts/compare-json-results.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,129 @@
import argparse
import json
import os
from importlib import util

import pandas as pd

plotly_found = util.find_spec("plotly.express") is not None


def compare_data_columns(
files, name_column, data_column, info_cols, drop_column, debug=False
):
print("\ncompare_data_column: " + data_column)
"""
Align concatenation by keys derived from info_cols instead of row order.
- Pick one canonical key list: subset of info_cols present in ALL files.
- For each file: set index to those keys, aggregate duplicates
- (mean for metric, first for names).
- Concat along axis=1 (indexes align), then reset_index so callers can
- group by columns.
- If --debug, add a <file_label>_name column per file.
"""
print("\ncompare_data_column:", data_column)

frames = []
raw_data_cols = []
compare_frames = []

# 1) choose a canonical key list from info_cols that exists in ALL files
cols_per_file = []
for f in files:
try:
df_tmp = pd.read_json(f, orient="records")
except Exception as err:
raise ValueError(f"Failed to read {f}") from err
cols_per_file.append(set(df_tmp.columns))

key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)]
if not key_cols:
# soft fallback: use any info_cols present in the first file
key_cols = [c for c in info_cols if c in list(cols_per_file[0])]
if not key_cols:
raise ValueError(
"No common key columns found from info_cols across the input files."
)

# 2) build a single "meta" block (keys as columns) once, aligned by the key index
meta_added = False

for file in files:
data_df = pd.read_json(file)
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
# Show all info columns in the first couple columns
if not frames:
for col in info_cols:
if col not in serving_df.columns:
print(f"Skipping missing column: {col}")
continue
frames.append(serving_df[col])
# only show test name under debug mode
if debug is True:
serving_df = serving_df.rename(columns={name_column: file + "_name"})
frames.append(serving_df[file + "_name"])

file = "/".join(file.split("/")[:-1])
serving_df = serving_df.rename(columns={data_column: file})
frames.append(serving_df[file])
raw_data_cols.append(file)
compare_frames.append(serving_df[file])
df = pd.read_json(file, orient="records")

# Keep rows that actually have the compared metric (same as original behavior)
if drop_column in df.columns:
df = df.dropna(subset=[drop_column], ignore_index=True)

# Stabilize numeric key columns (harmless if missing)
for c in (
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
):
if c in df.columns:
df[c] = pd.to_numeric(df[c], errors="coerce")

# Ensure all key columns exist
for c in key_cols:
if c not in df.columns:
df[c] = pd.NA

# Set index = key_cols and aggregate duplicates → unique MultiIndex
df_idx = df.set_index(key_cols, drop=False)

# meta (key columns), unique per key
meta = df_idx[key_cols]
if not meta.index.is_unique:
meta = meta.groupby(level=key_cols, dropna=False).first()

# metric series for this file, aggregated to one row per key
file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file)
s = df_idx[data_column]
if not s.index.is_unique:
s = s.groupby(level=key_cols, dropna=False).mean()
s.name = file_label # column label like original

# add meta once (from first file) so keys are the leftmost columns
if not meta_added:
frames.append(meta)
meta_added = True

# (NEW) debug: aligned test-name column per file
if debug and name_column in df_idx.columns:
name_s = df_idx[name_column]
if not name_s.index.is_unique:
name_s = name_s.groupby(level=key_cols, dropna=False).first()
name_s.name = f"{file_label}_name"
frames.append(name_s)

frames.append(s)
raw_data_cols.append(file_label)
compare_frames.append(s)

# Generalize ratio: for any file N>=2, add ratio (fileN / file1)
if len(compare_frames) >= 2:
# Compare numbers among two files
ratio_df = compare_frames[1] / compare_frames[0]
frames.append(ratio_df)
compare_frames.pop(1)
base = compare_frames[0]
current = compare_frames[-1]
ratio = current / base
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
frames.append(ratio)

# 4) concat on columns with aligned MultiIndex;
# then reset_index to return keys as columns
concat_df = pd.concat(frames, axis=1)
concat_df = concat_df.reset_index(drop=True).reset_index()
if "index" in concat_df.columns:
concat_df = concat_df.drop(columns=["index"])

# Ensure key/info columns appear first (in your info_cols order)
front = [c for c in info_cols if c in concat_df.columns]
rest = [c for c in concat_df.columns if c not in front]
concat_df = concat_df[front + rest]

print(raw_data_cols)
return concat_df, raw_data_cols

Expand All @@ -67,6 +152,15 @@ def split_json_by_tp_pp(

df = pd.DataFrame(data)

# Keep only "serving" tests
name_col = next(
(c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None
)
if name_col:
df = df[
df[name_col].astype(str).str.contains(r"serving", case=False, na=False)
].copy()

# Handle alias column names
rename_map = {
"tp_size": "TP Size",
Expand Down Expand Up @@ -124,7 +218,7 @@ def split_json_by_tp_pp(
"--xaxis",
type=str,
default="# of max concurrency.",
help="column name to use as X Axis in comparision graph",
help="column name to use as X Axis in comparison graph",
)
args = parser.parse_args()

Expand Down Expand Up @@ -181,16 +275,14 @@ def split_json_by_tp_pp(
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)

output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
html = group.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)

if plot is True:
import pandas as pd
if plot and plotly_found:
import plotly.express as px

df = group[raw_data_cols]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -382,7 +382,7 @@ run_genai_perf_tests() {
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--backend "$backend" \
--endpoint-type chat \
--streaming \
--url localhost:$port \
Expand Down
Loading