Skip to content

Bug fixed in vLLM Impl#25

Open
li2zhi wants to merge 2 commits into
Zefan-Cai:mainfrom
li2zhi:main
Open

Bug fixed in vLLM Impl#25
li2zhi wants to merge 2 commits into
Zefan-Cai:mainfrom
li2zhi:main

Conversation

@li2zhi

@li2zhi li2zhi commented Feb 6, 2026

Copy link
Copy Markdown

Thank you very much for your work on R-KV!

Over the past few days, I have been experimenting with the vLLM-based implementation. During this process, I observed inference errors occurring in both batch_size = 1 and batch_size > 1 settings.

After tracing through the vLLM execution and scheduling logic, I identified two issues that lead to these errors:

  1. num_dropped_tokens state is not reset after inference completes. As a result, stale num_dropped_tokens values may leak into subsequent inference runs.
  2. In the batch_size > 1 case, num_dropped_tokens is not updated during batch condensation.
    When some requests finish earlier than others, vLLM condenses the remaining requests to eliminate gaps via self.input_batch.condense(). During this process, metadata such as num_computed_tokens is correctly updated, but num_dropped_tokens is not. Consequently, before the next forward pass, prepare_inputs computes an incorrect seq_len, which diverges from the actual value and triggers downstream errors.

I have implemented a minimal fix for both issues in this PR. With these changes, R-KV now runs correctly on vLLM in my experiments.

Thank you again for your excellent work, and I hope this contribution is helpful to the community.

@li2zhi

li2zhi commented Feb 12, 2026

Copy link
Copy Markdown
Author

Additionally, I encountered another issue when using larger batch_size (e.g., 32 or 64). In such cases, the following error almost consistently occurs:

File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 176, in compute_slot_mapping block_table.compute_slot_mapping(req_indices, positions, is_occupied) File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 112, in compute_slot_mapping np.add(block_numbers * self.block_size, ValueError: operands could not be broadcast together with shapes (8203,) (8203,) (8192,)

After further analysis, I found that as decoding progresses, num_computed_tokens keeps increasing. Although KV cache compression may be triggered during decoding, the resulting num_kv_cache_tokens (num_computed_tokens - num_dropped_tokens) can still eventually exceed the capacity of arange_np.

self.arange_np = np.arange(max(self.max_num_reqs + 1, self.max_model_len, self.max_num_tokens), dtype=np.int64)

Once this happens, a shape mismatch arises during the computation of occupied_indices, leading to the broadcasting error shown above. A similar issue also occurs when updating occupied_slot_mapping.

For reference, my experiments are based on the snap-kv branch:
https://github.com/yeyang-zhou/vllm/tree/snap-kv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant