KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Seoul National University, NAVER AI Lab
Our approach identifies importance of KV pairs through context reconstruction and evicts low importance KV pairs.
Run demo in our GitHub repository!

Key benchmark results on query-agnostic setting with Qwen2.5-7B-Instruct-1M.

What is New?

KVzip compresses the KV cache to support diverse future queries. We support two use cases:
  • Context-dependent eviction that achieves a 3–4× reduction in KV cache size and a 2× decrease in decoding latency, with minimal performance loss. This approach introduces compression overhead per context (2× prefilling cost), particularly advantageous for cache-retrieval systems.
  • Context-independent eviction with a one-time optimization overhead per model, completed within one minute (100× faster than DuoAttention). This method achieves approximately 2× compression.
  • Problem

    Our objective is to enhance the inference efficiency of Transformer-based LLMs. To begin, we present the knowledge hierarchy in Transformers: some knowledge is encoded in weights, while models also process contextual information represented as key-value (KV) pairs. Caching these KV pairs reduces redundant processing, thus enhancing inference speed at the expense of increased storage.

    Figure 1. Knowledge hierarchy in Transformer LLMs.

    This hierarchy is analogous to the human memory system, where we either search textual information externally or recall it internally from memory. Some studies suggest memorization through synaptic consolidation, similar to knowledge encoded by neural weights. However, the Transformer memory system is inefficient: Transformers require tens of GBs of KV cache storage for merely 1 MB of text. Our work begins by identifying redundancies in the Transformer memory system and leveraging these observations to achieve more succinct knowledge representations.

    Previous Approaches

    Our focus is on a training-free eviction algorithm for KV cache compression. Most methods assign importance scores that determine the eviction order, typically using attention scores computed during prefilling or decoding. These approaches are efficient because they rely only on by-product attention scores generated during inference. However, reliance on currently available attention scores restricts compression capability, as the scores are inherently biased toward currently processed input queries.

    Figure 2. KV eviction frameworks for multi-query settings.

    Previous approaches, including SnapKV, typically use query-aware eviction methods. Using such methods, we illustrate three strategies for multi-query scenarios in the figure above: (a) repetitive prefilling and eviction achieve good accuracy but incur extensive prefilling costs (Figure 3, green); (b) reusing the compressed KV cache in a query-aware manner significantly reduces accuracy (Figure 3, blue); and (c) our work explores a query-agnostic approach, focusing solely on compressing contextual information. KVzip achieves strong compression performance even when queries are unavailable during the compression stage (Figure 3, red).

    Figure 3. Accuracy on the SQuAD multi-query dataset.

    Intuition

    The primary research question is how to identify the importance of KV pairs for future queries. This is particularly challenging due to the complexity and uncertainty of these queries. Our intuition begins with the constraint that LLM knowledge system (weight + KV cache) must retains the entire contextual information to answer arbitrary queries.

    Figure 4. Transformer LLM as a context encoder-decoder.

    We propose a prompting-based approach to evaluate the information completeness of the compressed LLM knowledge systems. Specifically, we prompt LLMs with "Repeat the previous context," given the compressed KV cache. Perfect context reconstruction indicates lossless inference—extremely, we can re-prefill the KV cache. Our experiments empirically confirm that this context reconstruction process uncovers a sparse attention structure within the KV cache, which effectively generalizes to diverse downstream tasks.

    Figure 5. KVzip procedure through an LLM forward pass.

    Fortunately, we can simulate the reconstruction process using teacher-forced decoding through LLM forward passes. By constructing repeated inputs, we calculate the maximal attention scores received by KV pairs in the cache and evict pairs with low scores. Our method supports arbitrary eviction structures and efficiently scales to long-context scenarios by incorporating additional techniques.

    Observation

    What makes KVzip effective? We observe that KVzip uncovers greater sparsity in KV pairs and derives attention patterns that generalize effectively to downstream tasks. In Figure 6, we visualize the maximum attention scores received by context KV pairs (KVc in Figure 5) across diverse inputs and compare them to the scores obtained during prefilling.

    One interesting observation is that repeating (Figure 6-a) overlaps attention patterns of downstream tasks (Figure 6-b,c). We empirically observe this property across diverse tasks, including summarization, reasoning, and even reversal of context. This result is practically valuable, indicating that context reconstruction effectively predicts KV utilization and identifies redundant KV pairs for downstream tasks.

    Why do these overlaps and sparsity exist? We find hints by comparing attention patterns observed during prefilling. In Figure-d, we see denser attention patterns compared to context reconstruction. This indicates that during prefilling, the model densely interacts with KV pairs to derive contextualized features, whereas in the decoding phase, the model leverages these resulting high-level features. This explains why KVzip outperforms previous eviction methods relying on attention patterns obtained during prefilling.

    Figure 6. Max attention scores received by KV pairs of the cached context of a SQuAD example. We visualize LLaMA3.1-8B, 8-th layer. The context token length is 163 and the number of KV heads is 8.


    Table 1. Inputs used for attention calculation in Figure 6 (SQuAD). We initially prefill the context and then process these inputs to obtain attention scores of the prefilled KV pairs.


    These are key features of KVzip! Please check out our paper and GitHub codes for more details.

    BibTeX

    @article{kim2025kvzip,
             title={KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction},
             author={Kim, Jang-Hyun and Kim, Jinuk and Kwon, Sangwoo and Lee, Jae W and Yun, Sangdoo and Song, Hyun Oh},
             journal={arXiv preprint arXiv:2505.23416},
             year={2025}}