Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

NAVER AI Lab
Computational flow of Fast KVzip.

Inference Efficiency

Prefill and decoding efficiency of the Qwen2.5-7B-1M model using a 30\% KV budget ratio with PyTorch and FlashAttention-2 on a single H100 GPU. Points in the plot correspond to context lengths of 160K, 240K, and 320K.

KV Cache Compression Performance

Prefill-intensive benchmark results with Qwen2.5-7B-Instruct-1M.
Decoding-intensive benchmark results with Qwen2.5-7B-Instruct-1M.

BibTeX

@article{kim2026fastkvzip,
         title={Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction},
         author={Kim, Jang-Hyun and Han, Dongyoon and Yun, Sangdoo},
         journal={arXiv preprint arXiv:2601.17668},
         year={2026}}