NPU使用vllm部署zai-org/GLM-4.5-Air回复结尾处出现重复性的短语

### System Info / 系統信息

使用vllm-ascend部署
vllm版本 0.10.0
服务器 8*910B

### Who can help? / 谁可以帮助到您？

_No response_

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [x] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

[root@worker-66 /]# vllm serve zai-org/GLM-4.5-Air --tensor-parallel-size 8 --served-model-name glm-4.5-air --reasoning-parser glm45
INFO 09-22 04:00:06 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:00:06 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:00:06 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:00:06 [__init__.py:226] Platform plugin ascend is activated
WARNING 09-22 04:00:08 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 09-22 04:00:09 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:00:10 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
usage: vllm serve [model_tag] [options]
vllm serve: error: argument --reasoning-parser: invalid choice: 'glm45' (choose from 'deepseek_r1', 'glm4_moe', 'granite', 'hunyuan_a13b', 'mistral', 'qwen3')
[root@worker-66 /]# vllm serve /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa --tensor-parallel-size 8 --served-model-name glm-4.5-air --reasoning-parser glm4_moe
INFO 09-22 04:44:26 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:44:26 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:44:26 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:44:26 [__init__.py:226] Platform plugin ascend is activated
WARNING 09-22 04:44:28 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 09-22 04:44:29 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:44:30 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 09-22 04:44:31 [api_server.py:1755] vLLM API server version 0.10.0
INFO 09-22 04:44:31 [cli_args.py:261] non-default args: {'model_tag': '/root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa', 'model': '/root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa', 'served_model_name': ['glm-4.5-air'], 'reasoning_parser': 'glm4_moe', 'tensor_parallel_size': 8}
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 09-22 04:44:44 [config.py:1604] Using max model len 131072
INFO 09-22 04:44:44 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 09-22 04:44:44 [ascend_config.py:180] ACL Graph is currently experimental. Please raise an issue on https://github.com/vllm-project/vllm-ascend/issues if you encourage any Error
INFO 09-22 04:44:44 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 09-22 04:44:44 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 20
INFO 09-22 04:44:44 [utils.py:348] Adjusted ACL graph batch sizes for Glm4MoeForCausalLM model (layers: 46): 67 → 20 sizes
INFO 09-22 04:44:55 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:44:55 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:44:55 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:44:55 [__init__.py:226] Platform plugin ascend is activated
WARNING 09-22 04:44:57 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 09-22 04:44:57 [core.py:572] Waiting for init message from front-end.
INFO 09-22 04:44:57 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:44:58 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 09-22 04:44:58 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='glm4_moe'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=glm-4.5-air, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,488,456,432,400,376,344,320,288,264,232,208,176,152,120,96,64,40,8,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 09-22 04:44:58 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-22 04:44:58 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_80f8a16d'), local_subscribe_addr='ipc:///tmp/4666543b-0cb7-4d72-80ce-846db12c6214', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:09 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:09 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:09 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:09 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:10 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:10 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:10 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:10 [__init__.py:226] Platform plugin ascend is activated
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:11 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:11 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 09-22 04:45:12 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 09-22 04:45:12 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:12 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 09-22 04:45:12 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 09-22 04:45:13 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 09-22 04:45:27 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:27 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:27 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:27 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:27 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:27 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:27 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:27 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:27 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:27 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:27 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:27 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:28 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:28 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:28 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:28 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:28 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:28 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:28 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:28 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:28 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:28 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:28 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:28 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:28 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:28 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:28 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:28 [__init__.py:226] Platform plugin ascend is activated
INFO 09-22 04:45:28 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 09-22 04:45:28 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 09-22 04:45:28 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 09-22 04:45:28 [__init__.py:226] Platform plugin ascend is activated
WARNING 09-22 04:45:29 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:29 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 09-22 04:45:30 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(VllmWorker rank=4 pid=10312) INFO 09-22 04:45:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f0618853'), local_subscribe_addr='ipc:///tmp/11e3c1f5-92c0-4cda-b7ea-5566d3a166c0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=6 pid=10314) INFO 09-22 04:45:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c5104355'), local_subscribe_addr='ipc:///tmp/c62dc5db-51c9-4682-9136-0022623ea163', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=10310) INFO 09-22 04:45:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_504ea940'), local_subscribe_addr='ipc:///tmp/81ce0054-d9fc-4543-8e25-b437c9740fbf', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=7 pid=10315) INFO 09-22 04:45:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_af2d9d63'), local_subscribe_addr='ipc:///tmp/1186414c-324f-4967-818f-7e180ff7fce7', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=10308) INFO 09-22 04:45:33 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dcb78c9a'), local_subscribe_addr='ipc:///tmp/335d767b-b1b6-4f2d-81bd-b064c0f8f45e', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=5 pid=10313) INFO 09-22 04:45:33 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e1b30b16'), local_subscribe_addr='ipc:///tmp/eaaf7ffd-0888-4830-87b7-080872373a57', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=10309) INFO 09-22 04:45:33 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_89b28167'), local_subscribe_addr='ipc:///tmp/64218be8-6bfd-4fd0-893a-8b33e3477c00', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=10311) INFO 09-22 04:45:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_df431005'), local_subscribe_addr='ipc:///tmp/c94bc795-e158-44e7-bc90-ba4f87caefa1', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=10308) INFO 09-22 04:45:35 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_25e092f2'), local_subscribe_addr='ipc:///tmp/86f32c61-d0f6-4c47-8b6d-243d87195349', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=10308) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=10309) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=3 pid=10311) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=2 pid=10310) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=7 pid=10315) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
(VllmWorker rank=5 pid=10313) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
(VllmWorker rank=6 pid=10314) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
(VllmWorker rank=4 pid=10312) INFO 09-22 04:45:35 [parallel_state.py:1102] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
(VllmWorker rank=5 pid=10313) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=1 pid=10309) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=3 pid=10311) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=7 pid=10315) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=4 pid=10312) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=6 pid=10314) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=2 pid=10310) INFO 09-22 04:45:36 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
(VllmWorker rank=0 pid=10308) INFO 09-22 04:45:37 [model_runner_v1.py:2084] Starting to load model /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/a24ceef6ce4f3536971efe9b778bdaa1bab18daa...
Loading safetensors checkpoint shards:   0% Completed | 0/47 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/47 [00:00<00:28,  1.62it/s]
Loading safetensors checkpoint shards:   4% Completed | 2/47 [00:02<00:51,  1.15s/it]
Loading safetensors checkpoint shards:   6% Completed | 3/47 [00:03<00:59,  1.35s/it]
Loading safetensors checkpoint shards:   9% Completed | 4/47 [00:04<00:50,  1.17s/it]
Loading safetensors checkpoint shards:  11% Completed | 5/47 [00:05<00:45,  1.09s/it]
Loading safetensors checkpoint shards:  13% Completed | 6/47 [00:06<00:40,  1.00it/s]
Loading safetensors checkpoint shards:  15% Completed | 7/47 [00:06<00:34,  1.16it/s]
Loading safetensors checkpoint shards:  17% Completed | 8/47 [00:07<00:29,  1.31it/s]
Loading safetensors checkpoint shards:  19% Completed | 9/47 [00:08<00:27,  1.37it/s]
Loading safetensors checkpoint shards:  21% Completed | 10/47 [00:08<00:25,  1.45it/s]
Loading safetensors checkpoint shards:  23% Completed | 11/47 [00:09<00:23,  1.53it/s]
Loading safetensors checkpoint shards:  26% Completed | 12/47 [00:09<00:22,  1.56it/s]
Loading safetensors checkpoint shards:  28% Completed | 13/47 [00:10<00:21,  1.58it/s]
Loading safetensors checkpoint shards:  30% Completed | 14/47 [00:11<00:19,  1.65it/s]
Loading safetensors checkpoint shards:  32% Completed | 15/47 [00:11<00:18,  1.70it/s]
Loading safetensors checkpoint shards:  34% Completed | 16/47 [00:12<00:18,  1.70it/s]
Loading safetensors checkpoint shards:  36% Completed | 17/47 [00:12<00:17,  1.68it/s]
Loading safetensors checkpoint shards:  38% Completed | 18/47 [00:13<00:17,  1.67it/s]
Loading safetensors checkpoint shards:  40% Completed | 19/47 [00:13<00:12,  2.18it/s]
Loading safetensors checkpoint shards:  43% Completed | 20/47 [00:14<00:14,  1.89it/s]
Loading safetensors checkpoint shards:  45% Completed | 21/47 [00:14<00:14,  1.81it/s]
Loading safetensors checkpoint shards:  47% Completed | 22/47 [00:15<00:14,  1.72it/s]
Loading safetensors checkpoint shards:  49% Completed | 23/47 [00:16<00:14,  1.69it/s]
Loading safetensors checkpoint shards:  51% Completed | 24/47 [00:16<00:13,  1.73it/s]
Loading safetensors checkpoint shards:  53% Completed | 25/47 [00:17<00:13,  1.66it/s]
Loading safetensors checkpoint shards:  55% Completed | 26/47 [00:18<00:15,  1.33it/s]
Loading safetensors checkpoint shards:  57% Completed | 27/47 [00:19<00:14,  1.34it/s]
Loading safetensors checkpoint shards:  60% Completed | 28/47 [00:20<00:14,  1.30it/s]
Loading safetensors checkpoint shards:  62% Completed | 29/47 [00:20<00:13,  1.34it/s]
Loading safetensors checkpoint shards:  64% Completed | 30/47 [00:21<00:11,  1.44it/s]
Loading safetensors checkpoint shards:  66% Completed | 31/47 [00:21<00:10,  1.48it/s]
Loading safetensors checkpoint shards:  68% Completed | 32/47 [00:22<00:09,  1.52it/s]
Loading safetensors checkpoint shards:  70% Completed | 33/47 [00:23<00:09,  1.51it/s]
Loading safetensors checkpoint shards:  72% Completed | 34/47 [00:23<00:08,  1.54it/s]
Loading safetensors checkpoint shards:  74% Completed | 35/47 [00:24<00:07,  1.61it/s]
Loading safetensors checkpoint shards:  77% Completed | 36/47 [00:24<00:05,  2.11it/s]
Loading safetensors checkpoint shards:  79% Completed | 37/47 [00:24<00:04,  2.10it/s]
Loading safetensors checkpoint shards:  81% Completed | 38/47 [00:25<00:04,  1.93it/s]
Loading safetensors checkpoint shards:  83% Completed | 39/47 [00:26<00:04,  1.86it/s]
Loading safetensors checkpoint shards:  85% Completed | 40/47 [00:26<00:03,  1.80it/s]
Loading safetensors checkpoint shards:  87% Completed | 41/47 [00:27<00:03,  1.75it/s]
Loading safetensors checkpoint shards:  89% Completed | 42/47 [00:28<00:02,  1.72it/s]
Loading safetensors checkpoint shards:  91% Completed | 43/47 [00:28<00:02,  1.73it/s]
Loading safetensors checkpoint shards:  94% Completed | 44/47 [00:29<00:01,  1.75it/s]
Loading safetensors checkpoint shards:  96% Completed | 45/47 [00:29<00:01,  1.74it/s]
Loading safetensors checkpoint shards:  98% Completed | 46/47 [00:30<00:00,  1.71it/s]
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:09 [default_loader.py:262] Loading weights took 31.39 seconds
Loading safetensors checkpoint shards: 100% Completed | 47/47 [00:30<00:00,  1.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 47/47 [00:30<00:00,  1.52it/s]
(VllmWorker rank=0 pid=10308) 
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:09 [default_loader.py:262] Loading weights took 31.04 seconds
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:10 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 32.44 seconds
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 32.92 seconds
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:10 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 32.82 seconds
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 32.89 seconds
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 33.00 seconds
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:10 [default_loader.py:262] Loading weights took 32.97 seconds
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:11 [model_runner_v1.py:2114] Loading model weights took 25.0148 GB
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_5_0/backbone for vLLM's torch.compile
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 15.82 s
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_3_0/backbone for vLLM's torch.compile
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 15.94 s
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_6_0/backbone for vLLM's torch.compile
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 16.17 s
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_7_0/backbone for vLLM's torch.compile
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 16.22 s
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_4_0/backbone for vLLM's torch.compile
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 16.25 s
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:28 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_2_0/backbone for vLLM's torch.compile
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:28 [backends.py:541] Dynamo bytecode transform time: 16.58 s
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:29 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:29 [backends.py:541] Dynamo bytecode transform time: 16.88 s
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:29 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2fffa9f74/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:29 [backends.py:541] Dynamo bytecode transform time: 16.99 s
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:33 [backends.py:215] Compiling a graph for dynamic shape takes 4.35 s
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:33 [backends.py:215] Compiling a graph for dynamic shape takes 4.33 s
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.54 s
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.29 s
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.53 s
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.40 s
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.74 s
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:34 [backends.py:215] Compiling a graph for dynamic shape takes 4.70 s
[rank0]:[W922 04:46:44.560646000 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:44 [monitor.py:34] torch.compile takes 21.39 s in total
[rank1]:[W922 04:46:45.297569870 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
[rank5]:[W922 04:46:45.401330030 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
[rank7]:[W922 04:46:45.408681270 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 21.62 s in total
[rank6]:[W922 04:46:45.478479280 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 20.17 s in total
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 20.51 s in total
[rank3]:[W922 04:46:45.573585260 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 20.70 s in total
[rank4]:[W922 04:46:45.641746520 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
[rank2]:[W922 04:46:45.656687320 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 20.48 s in total
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 21.28 s in total
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:45 [monitor.py:34] torch.compile takes 20.58 s in total
(VllmWorker rank=0 pid=10308) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 29748316160, total memory: 65452113920
(VllmWorker rank=5 pid=10313) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 30212474880, total memory: 65452113920
(VllmWorker rank=1 pid=10309) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 30223816704, total memory: 65452113920
(VllmWorker rank=6 pid=10314) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 30223804416, total memory: 65452113920
(VllmWorker rank=3 pid=10311) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 30221002752, total memory: 65452113920
(VllmWorker rank=2 pid=10310) INFO 09-22 04:46:46 [worker_v1.py:186] Available memory: 30223259648, total memory: 65452113920
(VllmWorker rank=7 pid=10315) INFO 09-22 04:46:47 [worker_v1.py:186] Available memory: 30221457408, total memory: 65452113920
(VllmWorker rank=4 pid=10312) INFO 09-22 04:46:47 [worker_v1.py:186] Available memory: 30222772224, total memory: 65452113920
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,262,976 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.64x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,200 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,200 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,072 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,200 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,282,688 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,200 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
INFO 09-22 04:46:47 [kv_cache_utils.py:833] GPU KV cache size: 1,283,072 tokens
INFO 09-22 04:46:47 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 9.79x
[rank0]:[W922 04:46:47.035165940 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank7]:[W922 04:46:47.063937100 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank3]:[W922 04:46:47.066175410 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank6]:[W922 04:46:47.066412170 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank5]:[W922 04:46:47.070995730 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank1]:[W922 04:46:47.092215690 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank2]:[W922 04:46:47.144598450 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank4]:[W922 04:46:48.276427550 compiler_depend.ts:149] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
(VllmWorker rank=5 pid=10313) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=0 pid=10308) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=6 pid=10314) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=3 pid=10311) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=1 pid=10309) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=2 pid=10310) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=7 pid=10315) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 25 secs, took 0.62 GiB
(VllmWorker rank=4 pid=10312) INFO 09-22 04:47:12 [model_runner_v1.py:2439] Graph capturing finished in 26 secs, took 0.62 GiB
INFO 09-22 04:47:12 [core.py:193] init engine (profile, create kv cache, warmup model) took 60.92 seconds
WARNING 09-22 04:47:13 [ascend_config.py:180] ACL Graph is currently experimental. Please raise an issue on https://github.com/vllm-project/vllm-ascend/issues if you encourage any Error
INFO 09-22 04:47:13 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 09-22 04:47:13 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 20
INFO 09-22 04:47:13 [utils.py:359] No adjustment needed for ACL graph batch sizes: Glm4MoeForCausalLM model (layers: 46) with 20 sizes
INFO 09-22 04:47:13 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 9867
INFO 09-22 04:47:14 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 09-22 04:47:14 [launcher.py:29] Available routes are:
INFO 09-22 04:47:14 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /health, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /load, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /ping, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /ping, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /version, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /pooling, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /classify, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /score, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /rerank, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /invocations, Methods: POST
INFO 09-22 04:47:14 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [9902]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 09-22 04:47:23 [chat_utils.py:473] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
INFO 09-22 04:47:23 [logger.py:41] Received request chatcmpl-c5cab288bbde422b8e229be32627d186: prompt: '[gMASK]<sop><|user|>\n详细解释一下你自己<|assistant|>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
INFO 09-22 04:47:23 [async_llm.py:269] Added request chatcmpl-c5cab288bbde422b8e229be32627d186.
INFO 09-22 04:47:24 [loggers.py:122] Engine 000: Avg prompt throughput: 0.8 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:47:34 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 09-22 04:47:44 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:40658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-22 04:47:54 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:48:04 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:48:05 [logger.py:41] Received request chatcmpl-0d88f98a7d694c82b0dcf498164167e4: prompt: '[gMASK]<sop><|user|>\n详细解释一下你自己<|assistant|>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
INFO 09-22 04:48:05 [async_llm.py:269] Added request chatcmpl-0d88f98a7d694c82b0dcf498164167e4.
INFO 09-22 04:48:14 [loggers.py:122] Engine 000: Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 38.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 09-22 04:48:24 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:51960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-22 04:48:34 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:48:44 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:48:49 [logger.py:41] Received request chatcmpl-a73150347c354121b1428859355679f5: prompt: '[gMASK]<sop><|user|>\n详细解释一下你自己<|assistant|>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
INFO 09-22 04:48:49 [async_llm.py:269] Added request chatcmpl-a73150347c354121b1428859355679f5.
INFO 09-22 04:48:54 [loggers.py:122] Engine 000: Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 20.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:49:04 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:48570 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-22 04:49:14 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 04:49:24 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

### Expected behavior / 期待表现

使用vllm serve zai-org/GLM-4.5-Air --tensor-parallel-size 8 --served-model-name glm-4.5-air --reasoning-parser glm45（这里无论是否添加--reasoning-parser glm45都会出现乱码）可以正常拉起服务，但是在使用openai接口调用服务时出现了问题
以下为调用代码
```
from openai import OpenAI

# 配置客户端，指向vLLM服务地址
client = OpenAI(
    base_url="http://localhost:8000/v1",  # vLLM服务地址+端口
    api_key="EMPTY"  # vLLM默认不需要API密钥，可设为任意值
)

# 调用聊天接口
def chat_completion_demo():
    response = client.chat.completions.create(
        model="glm-4.5-air",  # 需与vLLM部署的模型名一致
        messages=[
            {"role": "user", "content": "详细解释一下你自己"}
        ],
        temperature=0.7,
        max_tokens=1024
    )
    print("聊天接口响应：")
    print(response.choices[0].message.content)


if __name__ == "__main__":
    chat_completion_demo()
```
以下为响应内容
```
(base) [root@worker-66 ]# python test_vllm.py 
聊天接口响应：
用户需要持续了解和评价模型效果，持续优化模型效果、模型迭代优化模型，实现持续优化，不断进步和优化模型。持续优化模型性能，提高模型性能，提升模型能力和效果。</think>我详细解释一下我的情况如下：

我是一言以蔽的GLM（General-Purpose Language Model，通用目的语言模型），由智谱AI开发的开源大型语言模型。我能够处理多种语言，包括中文、支持理解各种语境下的语言任务。我的知识来源于多种来源，包括大规模语料库和开放知识库，经过大规模训练。我的能力主要体现在自然语言处理、内容创作内容创作方面。我能够理解和生成各种格式的文本内容，包括技术文档、技术规范、技术文档、专业标准等。我支持多种语言，支持多种自然语言处理语言模型，包括中文、英文、英汉双语文言模型型，支持中文理解和理解任务。我支持的任务类型非常广泛，支持软件、硬件安装、部署、部署方式、方式安装部署部署、部署部署、部署包、部署包、装载软件、载软件配置。

我支持的部署平台广泛，覆盖Windows、Linux、操作系统、类操作系统，类操作系统类型、系统类型、类类型、类型类型，类型支持支持范围广泛，范围部署部署范围，部署部署部署部署部署部署部署类部署部署部署部署部署部署部署类部署部署部署，部署部署，部署部署类，类型类型类型类型类型，类类型类型类型类型技术术类型类型技术性技术技术技术类技术类专业类型类型类型类型类型类型类型类型类型类型类型类型类技术技术技术类型类型技术类型类型和类型类型类型类型类型类型类型技术类型类型类型类型技术类型类型类型技术类型技术技术类型技术类型类型类型类型类型类型类型类型类型类型类型类型类型技术类型概念类型类型类型类型类型类型类型类类型技术类型技术类型类型类型类型类型类型类型类型类型技术类型类型类型类类型类类型类型类型类型、类型类型类型类型类型类型类型类型类技术类技术范围类技术类型范围类型类型类型等。总结一下，总结一下、总结、总结性总结性、总结，总结、总结，总结、总结，等等、技技、技术、技术、技术、技术、技术、技术、技术、技术、技术、技术、技术、技术、技术,、类型、类型、类型、类型、类型、类型、类型、类型、类型、类型类型、类型、类型、类型、类型、类型、类型、类型、类型、类型、类型、类型、类型、类型、技术、数据、数据、数据、数据，数据，数据，数据，数据、数据，数据,技术、技术、技术、数据、技术、技术和数据、技术模型, model, model, data、 data, data, model, model, model, model, model, model、 model模型， model、技术，技术， technology, technology, Technology, Technology， Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software, Software、模型, model, model, model, 模�, Model, Model, model, model, class, class, model, model, model, class, class, model, Model, Class, Model, Model, Model, Model, model, Model, Model, Model, Model, Model, Model, Model, Model, Model, Model, Model， Model, Model, Model, Model, Model, Model, Model， Model
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPU使用vllm部署zai-org/GLM-4.5-Air回复结尾处出现重复性的短语 #77

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NPU使用vllm部署zai-org/GLM-4.5-Air回复结尾处出现重复性的短语 #77

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions