Computational flow of Fast KVzip.
What is New?
Fast KVzip trains a lightweight gating mechanism for KV cache compression across both prefill
and decoding stages.
Our method achieves near-lossless performance on general tasks with up to a 70% KV cache eviction
ratio while significantly improving attention efficiency.
We propose a Low-Rank Sink Attention architecture for gates and directly distill KVzip’s importance
scores to train them in under one H100 hour.
Inference Efficiency
Prefill and decoding efficiency of Qwen2.5-7B-1M using a 30% KV budget ratio
with PyTorch and FlashAttention-2 on a H100 GPU. Points in the plot correspond to context lengths
of 160K, 240K, and 320K. KVzip provides a decoding speed similar to Fast KVzip.
KV Cache Compression Performance
Prefill-intensive benchmark results with Qwen2.5-7B-Instruct-1M.
Decoding-intensive benchmark results with Qwen2.5-7B-Instruct-1M.