Bug fixed in vLLM Impl by li2zhi · Pull Request #25 · Zefan-Cai/R-KV

li2zhi · 2026-02-06T06:04:39Z

Thank you very much for your work on R-KV!

Over the past few days, I have been experimenting with the vLLM-based implementation. During this process, I observed inference errors occurring in both batch_size = 1 and batch_size > 1 settings.

After tracing through the vLLM execution and scheduling logic, I identified two issues that lead to these errors:

num_dropped_tokens state is not reset after inference completes. As a result, stale num_dropped_tokens values may leak into subsequent inference runs.
In the batch_size > 1 case, num_dropped_tokens is not updated during batch condensation.
When some requests finish earlier than others, vLLM condenses the remaining requests to eliminate gaps via self.input_batch.condense(). During this process, metadata such as num_computed_tokens is correctly updated, but num_dropped_tokens is not. Consequently, before the next forward pass, prepare_inputs computes an incorrect seq_len, which diverges from the actual value and triggers downstream errors.

I have implemented a minimal fix for both issues in this PR. With these changes, R-KV now runs correctly on vLLM in my experiments.

Thank you again for your excellent work, and I hope this contribution is helpful to the community.

li2zhi · 2026-02-12T01:53:37Z

Additionally, I encountered another issue when using larger batch_size (e.g., 32 or 64). In such cases, the following error almost consistently occurs:

File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 176, in compute_slot_mapping block_table.compute_slot_mapping(req_indices, positions, is_occupied) File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 112, in compute_slot_mapping np.add(block_numbers * self.block_size, ValueError: operands could not be broadcast together with shapes (8203,) (8203,) (8192,)

After further analysis, I found that as decoding progresses, num_computed_tokens keeps increasing. Although KV cache compression may be triggered during decoding, the resulting num_kv_cache_tokens (num_computed_tokens - num_dropped_tokens) can still eventually exceed the capacity of arange_np.

self.arange_np = np.arange(max(self.max_num_reqs + 1, self.max_model_len, self.max_num_tokens), dtype=np.int64)

Once this happens, a shape mismatch arises during the computation of occupied_indices, leading to the broadcasting error shown above. A similar issue also occurs when updating occupied_slot_mapping.

For reference, my experiments are based on the snap-kv branch:
https://github.com/yeyang-zhou/vllm/tree/snap-kv

li2zhi added 2 commits February 6, 2026 10:44

Bug fixed in vLLM Impl

1252b21

Bug fixed in vLLM Impl

8e90b4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixed in vLLM Impl#25

Bug fixed in vLLM Impl#25
li2zhi wants to merge 2 commits into
Zefan-Cai:mainfrom
li2zhi:main

li2zhi commented Feb 6, 2026

Uh oh!

li2zhi commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

li2zhi commented Feb 6, 2026

Uh oh!

li2zhi commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant