-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Description
When I tried MTP with random input from
vllm bench serve
I got the following failvllm serve $MODEL -tp 4 --served-model-name qwen3-next --tokenizer-mode auto --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
vllm bench serve --backend vllm --model $MODEL --served-model-name qwen3-next --endpoint /v1/completions --dataset-name random --random-input 2048 --random-output 1024 --max-concurrency 256 --num-prompt 256
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] File "/home/scratch.vgimpelson_ent/vllm_qwen/vllm/config/__init__.py", line 3380, in pad_for_cudagraph (Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] return self.compilation_config.bs_to_padded_graph_size[batch_size] (Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ (Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] IndexError: list index out of range
The problem somewhere here
https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/gdn_attn.py#L211-L215
if (self.use_full_cuda_graph and num_prefills == 0 and num_decodes == 0
and num_spec_decodes <= self.decode_cudagraph_max_bs):
num_total_tokens = self.vllm_config.pad_for_cudagraph(
m.num_actual_tokens)
batch_size = num_total_tokens // (self.num_spec + 1)
during fail
num_spec_decodes = 228
m.num_actual_tokens = 228*3
self.decode_cudagraph_max_bs = 512
the if passed because 228 < 512, but self.vllm_config.pad_for_cudagraph fails because 228*3 than cudagraph_max_bs
Originally posted by @vadiklyutiy in #24526 (comment)
Metadata
Metadata
Assignees
Labels
No labels