Figure 1. Knowledge hierarchy in Transformer LLMs.
Figure 2. KV eviction frameworks for multi-query settings.
Figure 3. Accuracy on the SQuAD multi-query dataset.
Figure 4. Transformer LLM as a context encoder-decoder.
Figure 5. KVzip procedure through an LLM forward pass.
One interesting observation is that repeating (Figure 6-a) overlaps attention patterns of downstream tasks (Figure 6-b,c). We empirically observe this property across diverse tasks, including summarization, reasoning, and even reversal of context. This result is practically valuable, indicating that context reconstruction effectively predicts KV utilization and identifies redundant KV pairs for downstream tasks.
Why do these overlaps and sparsity exist? We find hints by comparing attention patterns observed during prefilling. In Figure-d, we see denser attention patterns compared to context reconstruction. This indicates that during prefilling, the model densely interacts with KV pairs to derive contextualized features, whereas in the decoding phase, the model leverages these resulting high-level features. This explains why KVzip outperforms previous eviction methods relying on attention patterns obtained during prefilling.
Figure 6. Max attention scores received by KV pairs of the cached context of a SQuAD example. We visualize LLaMA3.1-8B, 8-th layer. The context token length is 163 and the number of KV heads is 8.
Table 1. Inputs used for attention calculation in Figure 6 (SQuAD). We initially prefill the context and then process these inputs to obtain attention scores of the prefilled KV pairs.
@article{kim2025kvzip,
title={KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction},
author={Kim, Jang-Hyun and Kim, Jinuk and Kwon, Sangwoo and Lee, Jae W and Yun, Sangdoo and Song, Hyun Oh},
journal={arXiv preprint arXiv:2505.23416},
year={2025}}