Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

NAVER AI Lab
Computational flow of Fast KVzip.

What is New?

  • Fast KVzip trains a lightweight gating mechanism for KV cache compression across both prefill and decoding stages.
  • Our method achieves near-lossless performance on general tasks with up to a 70% KV cache eviction ratio while significantly improving attention efficiency.
  • We propose a Low-Rank Sink Attention architecture for gates and directly distill KVzip’s importance scores to train them in under one H100 hour.
  • Inference Efficiency

    Prefill and decoding efficiency of Qwen2.5-7B-1M using a 30% KV budget ratio with PyTorch and FlashAttention-2 on a H100 GPU. Points in the plot correspond to context lengths of 160K, 240K, and 320K. KVzip provides a decoding speed similar to Fast KVzip.

    KV Cache Compression Performance

    Prefill-intensive benchmark results with Qwen2.5-7B-Instruct-1M.
    Decoding-intensive benchmark results with Qwen2.5-7B-Instruct-1M.

    KV Importance Score Visualization

    10
    0

    Fast KVzip

    KVzip

    Loading data...
    We visualize Qwen2.5-7B-Instruct-1M having 28 layers and 4 KV heads.

    BibTeX

    @article{kim2026fastkvzip,
             title={Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction},
             author={Kim, Jang-Hyun and Han, Dongyoon and Yun, Sangdoo},
             journal={arXiv preprint arXiv:2601.17668},
             year={2026}}