-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
KV cache is expensive in long contexts. So far, I do not find KV cache compression options existing here. A related issue is #10585 , but it seems very high-level.
I am considering adding this feature here, with the first support on default setup on a single node (SnapKV + flashinfer + RadixCache) following the proposal below, SnapKV is selected as the POC since it's a typical method in KVPress and has the almost-best performance among them.
- migrate the essential code from KVPress, modify the hook to compress the SGLang KV cache by the compress ratio.
token_to_kv_poolelements should be already compressed by applying this. Since the KVPress hook is designed based on thetorch.modulehook, and backends in SGLang do not follow this, we can choose one of following- apply the compress methods in these backends, this is faster because the compressed KV directly go to the cache pool, but each backend needs to do compress
- after the cache is in the pool by the backend, take it out and compress, then insert back, this has a more unified logic, but would be slightly slower
- update the
req_to_tokencache by the compressed tokens, the token number needs to be adjusted by the compress ratio. - add server args and integration controls.
The short idea of KVPress is to compress the [seq_len, n_head, n_dim] to [seq_len * compress_ratio, n_head, n_dim] in prefill, the first version should be able to see the this change in req_to_token list
Related resources
No response
Metadata
Metadata
Assignees
Labels
No labels