Skip to content

[Bug]: Sparse Attention cannot be enabled on vLLM-Ascend due to NPUWorker initialization patching failure #474

@F00L42

Description

@F00L42

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False

OS: openEuler 22.03 (LTS-SP4) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.0.3
Libc version: glibc-2.34

Python version: 3.10.17 (main, May  8 2025, 08:13:48) [GCC 10.3.1] (64-bit runtime)
Python platform: Linux-5.10.0-60.18.0.50.h1002.eulerosv2r11.aarch64-aarch64-with-glibc2.34

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       HiSilicon
BIOS Vendor ID:                  HiSilicon
Model name:                      Kunpeng-920
BIOS Model name:                 HUAWEI Kunpeng 920 7265
Model:                           0
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       4
Stepping:                        0x1
Frequency boost:                 disabled
CPU max MHz:                     3000.0000
CPU min MHz:                     200.0000
BogoMIPS:                        200.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                       16 MiB (256 instances)
L1i cache:                       16 MiB (256 instances)
L2 cache:                        128 MiB (256 instances)
L3 cache:                        256 MiB (8 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
NUMA node4 CPU(s):               128-159
NUMA node5 CPU(s):               160-191
NUMA node6 CPU(s):               192-223
NUMA node7 CPU(s):               224-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.post1.dev20250619
[pip3] torchvision==0.20.1
[pip3] transformers==4.52.4
[conda] Could not collect
vLLM Version: 0.9.2
vLLM Ascend Version: 0.9.2rc1

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 25.0.rc1                 Version: 25.0.rc1                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B4               | OK            | 86.0        49                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          29017/ 32768         |
+===========================+===============+====================================================+
| 1     910B4               | OK            | 86.5        49                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          29104/ 32768         |
+===========================+===============+====================================================+
| 2     910B4               | OK            | 86.7        49                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          29102/ 32768         |
+===========================+===============+====================================================+
| 3     910B4               | OK            | 92.8        49                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          29102/ 32768         |
+===========================+===============+====================================================+
| 4     910B4               | OK            | 90.9        51                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          2867 / 32768         |
+===========================+===============+====================================================+
| 5     910B4               | OK            | 83.0        52                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          2864 / 32768         |
+===========================+===============+====================================================+
| 6     910B4               | OK            | 93.8        54                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          2862 / 32768         |
+===========================+===============+====================================================+
| 7     910B4               | OK            | 86.7        55                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          2862 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 1926451       |                          | 26033                   |
+===========================+===============+====================================================+
| 1       0                 | 1926453       |                          | 26289                   |
+===========================+===============+====================================================+
| 2       0                 | 1926455       |                          | 26289                   |
+===========================+===============+====================================================+
| 3       0                 | 1926457       |                          | 26289                   |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux

Ascend-vLLM container version:quay.io/ascend/vllm-ascend:v0.9.2rc1-openeuler
UCM version:v0.1.0

🐛 Describe the bug

What the bug is

I was trying to enable GSA implementation of Sparse Attention feature, following official documentation.
My attempt as follows:

bash command lines
export ENABLE_SPARSE=TRUE
vllm serve /opt/Qwen3-32B/ \
--enforce-eager \
--max-model-len 131000 \
--tensor-parallel-size 4 \
--gpu_memory_utilization 0.87 \
--port 8225 \
--block-size 128 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
    "kv_connector": "UCMConnector",
    "kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"
    }
}'
UCM_CONFIG_FILE
ucm_connectors:
  - ucm_connector_name: "UcmNfsStore"
    ucm_connector_config:
      storage_backends: "/vllm-workspace/kv_storage"
      use_direct: false

load_only_first_rank: false

# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"

# Sparse attention configuration
# Format 1: Dictionary format (for methods like ESA, KvComp)
ucm_sparse_config:
  GSA: {}

However, the _UCM_SPARSE_AGENT fails to initialize properly within the vLLM worker process. As a result, has_ucm_sparse() returns False during model execution, and the sparse attention logic is skipped.

Investigation reveals that the issue stems from a timing condition where ucm attempts to monkey-patch _init_worker_distributed_environment while the process is already inside the execution of that very function:

NPUWorker._init_worker_distributed_environment = (
patched_init_worker_distributed_environment
)

Root Cause Analysis

The failure is caused by the sequence of execution and the timing of the monkey patch application.

  • Execution Start: The vLLM worker process starts and calls NPUWorker._init_worker_distributed_environment (via init_device).
  • Triggering Import: Inside _init_worker_distributed_environment, the call to ensure_kv_transfer_initialized triggers the import of the ucm module.
  • Patch Application: The initialization of ucm executes _patch_worker_v1, which modifies NPUWorker._init_worker_distributed_environment to wrap it with ensure_ucm_sparse_initialized.
  • The Conflict:
    • At the moment the patch is applied, the Python interpreter is already executing the original _init_worker_distributed_environment (it is in the current stack frame).
    • Replacing the class method attribute NPUWorker._init_worker_distributed_environment only affects subsequent calls to this method.
    • Since _init_worker_distributed_environment is a lifecycle method called only once during worker startup, the function completes its original execution body and returns.
    • The wrapper function (which contains the call to ensure_ucm_sparse_initialized) is never executed.

Stack Trace & Evidence

The following call stack confirms that ucm initialization (and thus the patching) occurs strictly inside the execution scope of the target method _init_worker_distributed_environment:

<module> (\vllm-workspace\unified-cache-management\ucm_init_.py:6)
import_module (\usr\local\python3.10.17\lib\python3.10\importlib_init_.py:126)
create_connector_v1 (\vllm-workspace\vllm\vllm\distributed\kv_transfer\kv_connector\factory.py:71)
ensure_kv_transfer_initialized (\vllm-workspace\vllm\vllm\distributed\kv_transfer\kv_transfer_state.py:64)
_init_worker_distributed_environment (\vllm-workspace\vllm-ascend\vllm_ascend\worker\worker_v1.py:328)
init_device (\vllm-workspace\vllm-ascend\vllm_ascend\worker\worker_v1.py:142)
init_device (\vllm-workspace\vllm\vllm\worker\worker_base.py:606)
init (\vllm-workspace\vllm\vllm\v1\executor\multiproc_executor.py:361)
worker_main (\vllm-workspace\vllm\vllm\v1\executor\multiproc_executor.py:465)
run (\usr\local\python3.10.17\lib\python3.10\multiprocessing\process.py:108)
_bootstrap (\usr\local\python3.10.17\lib\python3.10\multiprocessing\process.py:314)
_main (\usr\local\python3.10.17\lib\python3.10\multiprocessing\spawn.py:129)
spawn_main (\usr\local\python3.10.17\lib\python3.10\multiprocessing\spawn.py:116)
<module> (<string>:1)

Suggested Fix

The initialization hook should be moved to a lifecycle method that is guaranteed to execute after init_device returns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions