You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over the past few days, I have been experimenting with the vLLM-based implementation. During this process, I observed inference errors occurring in both batch_size = 1 and batch_size > 1 settings.
After tracing through the vLLM execution and scheduling logic, I identified two issues that lead to these errors:
num_dropped_tokens state is not reset after inference completes. As a result, stale num_dropped_tokens values may leak into subsequent inference runs.
In the batch_size > 1 case, num_dropped_tokens is not updated during batch condensation.
When some requests finish earlier than others, vLLM condenses the remaining requests to eliminate gaps via self.input_batch.condense(). During this process, metadata such as num_computed_tokens is correctly updated, but num_dropped_tokens is not. Consequently, before the next forward pass, prepare_inputs computes an incorrect seq_len, which diverges from the actual value and triggers downstream errors.
I have implemented a minimal fix for both issues in this PR. With these changes, R-KV now runs correctly on vLLM in my experiments.
Thank you again for your excellent work, and I hope this contribution is helpful to the community.
Additionally, I encountered another issue when using larger batch_size (e.g., 32 or 64). In such cases, the following error almost consistently occurs:
File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 176, in compute_slot_mapping block_table.compute_slot_mapping(req_indices, positions, is_occupied) File "/root/autodl-tmp/vllm/vllm/v1/worker/block_table.py", line 112, in compute_slot_mapping np.add(block_numbers * self.block_size, ValueError: operands could not be broadcast together with shapes (8203,) (8203,) (8192,)
After further analysis, I found that as decoding progresses, num_computed_tokens keeps increasing. Although KV cache compression may be triggered during decoding, the resulting num_kv_cache_tokens (num_computed_tokens - num_dropped_tokens) can still eventually exceed the capacity of arange_np.
Once this happens, a shape mismatch arises during the computation of occupied_indices, leading to the broadcasting error shown above. A similar issue also occurs when updating occupied_slot_mapping.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thank you very much for your work on R-KV!
Over the past few days, I have been experimenting with the vLLM-based implementation. During this process, I observed inference errors occurring in both
batch_size = 1andbatch_size > 1settings.After tracing through the vLLM execution and scheduling logic, I identified two issues that lead to these errors:
num_dropped_tokensstate is not reset after inference completes. As a result, stalenum_dropped_tokensvalues may leak into subsequent inference runs.batch_size > 1case,num_dropped_tokensis not updated during batch condensation.When some requests finish earlier than others, vLLM condenses the remaining requests to eliminate gaps via
self.input_batch.condense(). During this process, metadata such asnum_computed_tokensis correctly updated, butnum_dropped_tokensis not. Consequently, before the next forward pass,prepare_inputscomputes an incorrectseq_len, which diverges from the actual value and triggers downstream errors.I have implemented a minimal fix for both issues in this PR. With these changes, R-KV now runs correctly on vLLM in my experiments.
Thank you again for your excellent work, and I hope this contribution is helpful to the community.