[Bug]: Sparse Attention cannot be enabled on vLLM-Ascend due to NPUWorker initialization patching failure

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False

OS: openEuler 22.03 (LTS-SP4) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.0.3
Libc version: glibc-2.34

Python version: 3.10.17 (main, May  8 2025, 08:13:48) [GCC 10.3.1] (64-bit runtime)
Python platform: Linux-5.10.0-60.18.0.50.h1002.eulerosv2r11.aarch64-aarch64-with-glibc2.34

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       HiSilicon
BIOS Vendor ID:                  HiSilicon
Model name:                      Kunpeng-920
BIOS Model name:                 HUAWEI Kunpeng 920 7265
Model:                           0
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       4
Stepping:                        0x1
Frequency boost:                 disabled
CPU max MHz:                     3000.0000
CPU min MHz:                     200.0000
BogoMIPS:                        200.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                       16 MiB (256 instances)
L1i cache:                       16 MiB (256 instances)
L2 cache:                        128 MiB (256 instances)
L3 cache:                        256 MiB (8 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
NUMA node4 CPU(s):               128-159
NUMA node5 CPU(s):               160-191
NUMA node6 CPU(s):               192-223
NUMA node7 CPU(s):               224-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.post1.dev20250619
[pip3] torchvision==0.20.1
[pip3] transformers==4.52.4
[conda] Could not collect
vLLM Version: 0.9.2
vLLM Ascend Version: 0.9.2rc1

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 25.0.rc1                 Version: 25.0.rc1                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B4               | OK            | 86.0        49                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          29017/ 32768         |
+===========================+===============+====================================================+
| 1     910B4               | OK            | 86.5        49                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          29104/ 32768         |
+===========================+===============+====================================================+
| 2     910B4               | OK            | 86.7        49                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          29102/ 32768         |
+===========================+===============+====================================================+
| 3     910B4               | OK            | 92.8        49                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          29102/ 32768         |
+===========================+===============+====================================================+
| 4     910B4               | OK            | 90.9        51                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          2867 / 32768         |
+===========================+===============+====================================================+
| 5     910B4               | OK            | 83.0        52                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          2864 / 32768         |
+===========================+===============+====================================================+
| 6     910B4               | OK            | 93.8        54                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          2862 / 32768         |
+===========================+===============+====================================================+
| 7     910B4               | OK            | 86.7        55                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          2862 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 1926451       |                          | 26033                   |
+===========================+===============+====================================================+
| 1       0                 | 1926453       |                          | 26289                   |
+===========================+===============+====================================================+
| 2       0                 | 1926455       |                          | 26289                   |
+===========================+===============+====================================================+
| 3       0                 | 1926457       |                          | 26289                   |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux
```

</details>

Ascend-vLLM container version：`quay.io/ascend/vllm-ascend:v0.9.2rc1-openeuler`
UCM version：[`v0.1.0`](https://github.com/ModelEngine-Group/unified-cache-management/tree/v0.1.0)

### 🐛 Describe the bug

## What the bug is
I was trying to enable `GSA` implementation of `Sparse Attention` feature, following official [documentation](https://github.com/ModelEngine-Group/unified-cache-management/blob/v0.1.0/docs/source/user-guide/sparse-attention/gsa.md).
My attempt as follows:

<details><summary>bash command lines</summary>

```bash
export ENABLE_SPARSE=TRUE
vllm serve /opt/Qwen3-32B/ \
--enforce-eager \
--max-model-len 131000 \
--tensor-parallel-size 4 \
--gpu_memory_utilization 0.87 \
--port 8225 \
--block-size 128 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
    "kv_connector": "UCMConnector",
    "kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"
    }
}'
```
</details>

<details><summary>UCM_CONFIG_FILE</summary>

```
ucm_connectors:
  - ucm_connector_name: "UcmNfsStore"
    ucm_connector_config:
      storage_backends: "/vllm-workspace/kv_storage"
      use_direct: false

load_only_first_rank: false

# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"

# Sparse attention configuration
# Format 1: Dictionary format (for methods like ESA, KvComp)
ucm_sparse_config:
  GSA: {}
```
</details>

However, the _UCM_SPARSE_AGENT fails to initialize properly within the vLLM worker process. As a result, [`has_ucm_sparse()`](https://github.com/ModelEngine-Group/unified-cache-management/blob/v0.1.0/ucm/integration/vllm/patch/patch_funcs/v092/vllm_ascend_patch.py#L1346) returns False during model execution, and **the sparse attention logic is skipped**.

Investigation reveals that the issue stems from a timing condition where ucm attempts to monkey-patch _init_worker_distributed_environment while the process is already inside the execution of that very [function](https://github.com/vllm-project/vllm-ascend/blob/v0.9.2rc1/vllm_ascend/worker/worker_v1.py#L323):

https://github.com/ModelEngine-Group/unified-cache-management/blob/5ba26849ebed3ecaeb7c318e8e38402d807009f3/ucm/integration/vllm/patch/patch_funcs/v092/vllm_ascend_patch.py#L1441-L1443
## Root Cause Analysis
The failure is caused by the sequence of execution and the timing of the monkey patch application.
* Execution Start: The vLLM worker process starts and calls NPUWorker._init_worker_distributed_environment (via init_device).
* Triggering Import: Inside _init_worker_distributed_environment, the call to ensure_kv_transfer_initialized triggers the import of the ucm module.
* Patch Application: The initialization of ucm executes _patch_worker_v1, which modifies NPUWorker._init_worker_distributed_environment to wrap it with ensure_ucm_sparse_initialized.
* The **Conflict**:
  * At the moment the patch is applied, the Python interpreter is already executing the original `_init_worker_distributed_environment` (it is in the current stack frame).
  * Replacing the class method attribute NPUWorker._init_worker_distributed_environment only affects subsequent calls to this method.
  * Since `_init_worker_distributed_environment` is a lifecycle method called only once during worker startup, the function completes its original execution body and returns.
  * The wrapper function (which contains the call to ensure_ucm_sparse_initialized) is never executed.

## Stack Trace & Evidence
The following call stack confirms that ucm initialization (and thus the patching) occurs strictly inside the execution scope of the target method `_init_worker_distributed_environment`:
```
<module> (\vllm-workspace\unified-cache-management\ucm_init_.py:6)
import_module (\usr\local\python3.10.17\lib\python3.10\importlib_init_.py:126)
create_connector_v1 (\vllm-workspace\vllm\vllm\distributed\kv_transfer\kv_connector\factory.py:71)
ensure_kv_transfer_initialized (\vllm-workspace\vllm\vllm\distributed\kv_transfer\kv_transfer_state.py:64)
_init_worker_distributed_environment (\vllm-workspace\vllm-ascend\vllm_ascend\worker\worker_v1.py:328)
init_device (\vllm-workspace\vllm-ascend\vllm_ascend\worker\worker_v1.py:142)
init_device (\vllm-workspace\vllm\vllm\worker\worker_base.py:606)
init (\vllm-workspace\vllm\vllm\v1\executor\multiproc_executor.py:361)
worker_main (\vllm-workspace\vllm\vllm\v1\executor\multiproc_executor.py:465)
run (\usr\local\python3.10.17\lib\python3.10\multiprocessing\process.py:108)
_bootstrap (\usr\local\python3.10.17\lib\python3.10\multiprocessing\process.py:314)
_main (\usr\local\python3.10.17\lib\python3.10\multiprocessing\spawn.py:129)
spawn_main (\usr\local\python3.10.17\lib\python3.10\multiprocessing\spawn.py:116)
<module> (<string>:1)
```

## Suggested Fix
The initialization hook should be moved to a lifecycle method that is guaranteed to execute after `init_device` returns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Sparse Attention cannot be enabled on vLLM-Ascend due to NPUWorker initialization patching failure #474

Your current environment

🐛 Describe the bug

What the bug is

Root Cause Analysis

Stack Trace & Evidence

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	NPUWorker._init_worker_distributed_environment = (
	patched_init_worker_distributed_environment
	)

[Bug]: Sparse Attention cannot be enabled on vLLM-Ascend due to NPUWorker initialization patching failure #474

Description

Your current environment

🐛 Describe the bug

What the bug is

Root Cause Analysis

Stack Trace & Evidence

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions