KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Computational flow of Fast KVzip.

What is New?

Fast KVzip trains a lightweight gating mechanism for KV cache compression across both prefill and decoding stages.

Our method achieves near-lossless performance on general tasks with up to a 70% KV cache eviction ratio while significantly improving attention efficiency.

We propose a Low-Rank Sink Attention architecture for gates and directly distill KVzip’s importance scores to train them in under one H100 hour.

Inference Efficiency

Prefill and decoding efficiency of Qwen2.5-7B-1M using a 30% KV budget ratio with PyTorch and FlashAttention-2 on a H100 GPU. Points in the plot correspond to context lengths of 160K, 240K, and 320K. KVzip provides a decoding speed similar to Fast KVzip.

KV Cache Compression Performance

Prefill-intensive benchmark results with Qwen2.5-7B-Instruct-1M.

Decoding-intensive benchmark results with Qwen2.5-7B-Instruct-1M.

BibTeX

@article{kim2026fastkvzip,
         title={Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction},
         author={Kim, Jang-Hyun and Han, Dongyoon and Yun, Sangdoo},
         journal={arXiv preprint arXiv:2601.17668},
         year={2026}}

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

What is New?

Inference Efficiency

KV Cache Compression Performance

KV Importance Score Visualization

Fast KVzip

KVzip

BibTeX