Skip to content

[Feature] KVPress integration proposal #12607

@ai-easy-cpu

Description

@ai-easy-cpu

Checklist

Motivation

KV cache is expensive in long contexts. So far, I do not find KV cache compression options existing here. A related issue is #10585 , but it seems very high-level.

I am considering adding this feature here, with the first support on default setup on a single node (SnapKV + flashinfer + RadixCache) following the proposal below, SnapKV is selected as the POC since it's a typical method in KVPress and has the almost-best performance among them.

  • migrate the essential code from KVPress, modify the hook to compress the SGLang KV cache by the compress ratio. token_to_kv_pool elements should be already compressed by applying this. Since the KVPress hook is designed based on the torch.module hook, and backends in SGLang do not follow this, we can choose one of following
    • apply the compress methods in these backends, this is faster because the compressed KV directly go to the cache pool, but each backend needs to do compress
    • after the cache is in the pool by the backend, take it out and compress, then insert back, this has a more unified logic, but would be slightly slower
  • update the req_to_token cache by the compressed tokens, the token number needs to be adjusted by the compress ratio.
  • add server args and integration controls.

The short idea of KVPress is to compress the [seq_len, n_head, n_dim] to [seq_len * compress_ratio, n_head, n_dim] in prefill, the first version should be able to see the this change in req_to_token list

Related resources

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions