R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Shrink the cache, keep the brains.
R-KV discards repetitive tokens on-the-fly, delivering full-accuracy reasoning with only a fraction of the memory.
Use the following command to install the minimal required dependencies:
pip install -r requirements.txt
If you’re using Hugging Face, we default to flash attention to speed up attention computation:
model = AutoModelForCausalLM.from_pretrained(
"model_name_or_path",
attn_implementation="flash_attention_2",
)
You need to build the dependencies of the evaluation toolkit separately:
cd evaluation/latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
Before running the scripts, you need to build the rkv package:
pip install -e .
Use the following command to run R1-like models with R-KV on math benchmarks:
bash examples/run.sh
Or you could use the code scripts:
export CUDA_VISIBLE_DEVICES=0
python3 ./run_math.py \
--dataset_path ./data/aime24.jsonl \
--save_path ./outputs/output.jsonl \
--model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max_length 16484 \
--eval_batch_size 1 \
--method rkv \
--kv_budget 128
To evaluate benchmark results, simply run:
bash examples/eval.sh
The results will be saved in the outputs
directory.
Large language models that rely on chain-of-thought (CoT) or self-reflection can crack tough reasoning tasks—but at the cost of very long outputs that bloat the key–value (KV) cache during inference.
Traditional cache-compression schemes, tuned for long prompts, tumble on these generated traces, keeping only ~60 % of the original accuracy when restricted to 10 % of the cache.
R-KV — Redundancy-aware KV-cache compression for Reasoning models — solves this by ranking tokens on-the-fly for both importance and non-redundancy, retaining only the informative, diverse ones.
Metric | Full KV | R-KV (ours) |
---|---|---|
KV cache kept | 100 % | 10 % |
Accuracy | 100 % | ≈100 % |
Throughput ↑ | 1× | 6.6× |
Memory saved ↓ | – | 90 % |
At 16 % cache, noise removal even boosts accuracy to 105 % of the full baseline.
Chain-of-thought (CoT) and self-reflection unlock impressive reasoning, but they explode the key–value (KV) cache.
A single DeepSeek-R1-Distill-8B run on a tough math problem can:
Existing compression tools focus on long prompts and falter on long generations—often pruning the wrong tokens because redundant self-checks still attend heavily to themselves.
Redundancy-aware KV Cache Compression for R1-Style models (i.e., R-KV) is the first production-level KV Cache Compression Method for serving R1-like models.
Large-language models (LLMs) waste a surprising amount of memory on redundant key/value (KV) tokens during long-form reasoning.
R-KV tackles this by compressing the KV cache on-the-fly while the model is decoding, keeping only tokens that are important and non-redundant.
Stage | What happens | Key idea |
---|---|---|
1. Decoding-time KV staging | Newly generated tokens are first written to a buffer (B_buffer ). |
Separate buffer lets us decide what to keep after seeing a chunk of text. |
2. Importance scoring | Use attention weights from the last α observation tokens to score each candidate token. |
High attention ⇒ token is critical for future predictions. |
3. Redundancy estimation | Compute cosine similarity between key vectors. For each token keep at most β most-recent highly similar neighbors; older near-duplicates are marked redundant. |
Keeps semantics while pruning repetition. |
4. Joint selection | Final score Z = λ·Importance − (1-λ)·Redundancy . Top-k tokens + α observation tokens are retained in the budgeted cache (B_budget ). |
One knob (λ ) trades memory vs. quality. |
Our method surpassed baselines by a large margin in challenging math benchmarks, and surprisingly, even outperformed full KV.
We benchmark two distilled DeepSeek-R1 variants:
Nickname | Checkpoint |
---|---|
R1-Llama-8B | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
R1-Qwen-14B | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
Benchmark | Tokens / sol. (avg) | Max gen len |
---|---|---|
MATH-500 | 2 979 | 16 384 |
AIME 2024 | 15 536 | 32 768 |
Symbol | Meaning | Value |
---|---|---|
B_buffer |
staging-buffer size | 128 |
α |
# observation tokens for attention probe | 8 |
λ |
importance / redundancy trade-off | 0.1 (see § λ-sweep) |
Sampling: temperature 0.6, top-p 0.95, 64 candidates per problem; we report pass@1.
Pass@1 vs. KV-cache budget on MATH-500 and AIME-24.
R-KV attains parity with 10–34 % cache and even beats the FullKV baseline at 10 %.
Model | Dataset | Lossless @ Ratio | Lossless @ Fixed tokens |
---|---|---|---|
R1-Llama-8B | MATH-500 | 34 % | 1 024 |
R1-Llama-8B | AIME-24 | 10 % | 1 536 |
R1-Qwen-14B | MATH-500 | 54 % | 1 536 |
R1-Qwen-14B | AIME-24 | 25 % | 3 072 |
At 16 % (Llama-8B) or 33 % (Qwen-14B) cache, R-KV reaches 105 % of FullKV accuracy—evidence that trimming redundant tokens can improve reasoning quality.
R-KV allocates two fixed-size buffers—one for the retained KV cache and one for freshly generated tokens.
Because these buffers stay the same size regardless of sequence length, memory usage remains constant, unlike FullKV, whose memory grows linearly with the sequence.
R-KV keeps two small, fixed-size buffers:
Buffer | Purpose | Shape |
---|---|---|
KV budget | Retained tokens | b × B_budget × L × H × d |
KV buffer | Fresh tokens | b × B_buffer × L × H × d |
b
= batch ·L
= #layers ·H
= #heads ·d
= head-dim
An optional query cache keeps only the last α
query states.
R-KV adds a lightweight scoring pass for importance and redundancy, but the extra FLOPs are quickly offset by having to attend over a much smaller, compressed KV cache.
As sequences get longer, the cost/benefit curve tips even further in R-KV’s favor (details in the same appendix).
Stage | Complexity |
---|---|
Importance scoring | O(α × B_budget) |
Redundancy scoring | O(B_budget²) |
Attention (compressed) | O((B_budget + B_buffer) × B_buffer) |
Attention (FullKV) | O(B_full × B_buffer) |
For long sequences (B_full ≫ B_budget
), the tiny cache more than offsets scoring overhead.
We measured both memory savings and end-to-end throughput (see Table efficiency
in the paper):
Output Len | Compression | Batch Gain | TPS Gain |
---|---|---|---|
8 K | 54 % | 1.7 × | 1.5 × |
10 % | 7.7 × | 4.5 × | |
16 K | 54 % | 1.5 × | 1.7 × |
10 % | 9.0 × | 6.6 × |
Output Len | B_budget |
Batch Gain | TPS Gain |
---|---|---|---|
8 K | 1024 | 6.5 × | 3.8 × |
1536 | 4.6 × | 3.0 × | |
16 K | 1024 | 13.4 × | 9.2 × |
1536 | 9.6 × | 7.1 × |
TPS = tokens per second. Throughput scales almost linearly with batch until compute saturation (~128 sequences on an A100).
Full benchmarking scripts and raw logs are provided in below table.
Gen. Length | Method | Budget | Mem. Saving (%) | Batch | Throughput (tok/s) | Tokens Gen. | Dec. Time (s) |
---|---|---|---|---|---|---|---|
8 K | FullKV | – | – | 1 | 75.44 | 8,094 | 107.30 |
8 K | FullKV | – | – | 62 (max) | 849.13 | 501,828 | 590.99 |
8 K | R-KV | Fixed – 1024 | 87.50 | 1 | 80.46 | 8,094 | 100.60 |
8 K | R-KV | Fixed – 1024 | 87.50 | 402 (max) | 3,251.52 | 3,253,788 | 1,000.70 |
8 K | R-KV | Fixed – 1536 | 81.25 | 287 (max) | 2,525.75 | 6,546,972 | 919.72 |
8 K | R-KV | Fixed – 3072 | 62.50 | 150 (max) | 1,520.99 | 1,214,100 | 798.23 |
8 K | R-KV | Ratio – 10 % – 819 | 90.00 | 479 (max) | 3,809.15 | 3,877,026 | 1,017.82 |
8 K | R-KV | Ratio – 34 % – 2,785 | 66.00 | 167 (max) | 1,608.01 | 1,351,698 | 840.61 |
8 K | R-KV | Ratio – 54 % – 4,423 | 46.00 | 105 (max) | 1,257.83 | 849,870 | 675.66 |
16 K | FullKV | – | – | 1 | 69.41 | 16,286 | 234.65 |
16 K | FullKV | – | – | 30 (max) | 347.03 | 488,580 | 1,407.89 |
16 K | R-KV | Fixed – 1024 | 93.75 | 1 | 80.95 | 16,286 | 201.18 |
16 K | R-KV | Fixed – 1024 | 93.75 | 402 (max) | 3,188.82 | 6,546,972 | 2,053.10 |
16 K | R-KV | Fixed – 1536 | 90.63 | 287 (max) | 2,447.61 | 4,674,082 | 1,909.65 |
16 K | R-KV | Fixed – 3072 | 81.25 | 150 (max) | 1,406.28 | 2,442,900 | 1,737.13 |
16 K | R-KV | Ratio – 10 % – 1,638 | 90.00 | 271 (max) | 2,300.28 | 4,413,506 | 1,918.68 |
16 K | R-KV | Ratio – 34 % – 5,570 | 66.00 | 82 (max) | 797.43 | 1,335,452 | 1,674.70 |
16 K | R-KV | Ratio – 54 % – 8,847 | 46.00 | 46 (max) | 584.77 | 749,156 | 1,281.12 |
The figure below shows which tokens are picked by R-KV and the pure-attention baseline SnapKV at the same decoding step.
Grey = not selected | Light orange → Dark red = selected tokens (deeper red = chosen by more attention heads)
Broader Coverage
With its hybrid score (attention × redundancy suppression), R-KV selects tokens that are spread across the whole output, preserving critical context cues.
Higher Information Diversity
Unlike SnapKV—which prefers tokens near the query and often selects the same local tokens repeatedly—R-KV captures valuable pieces of information at diverse positions.
Significantly Less Redundancy
SnapKV still picks distant, low-value segments (e.g., “3 students are leaving early.”, “But in the initial”).
R-KV’s redundancy check filters most of these out.
By combining attention strength with redundancy filtering, R-KV retains the important context and removes noise, successfully completing the task.
In this example, the pure attention strategy of SnapKV fails due to limited coverage and excess redundancy.
We implement visualization functions to help illustrate the multi-step token eviction pattern.
Run analysis_scripts/analysis.ipynb
to see which tokens are kept at each compression step.
If you find R-KV useful in your research, please cite us:
@article{cai2025r,
title={R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration},
author={Cai, Zefan and Xiao, Wen and Sun, Hanshi and Luo, Cheng and Zhang, Yikai and Wan, Ke and Li, Yucheng and Zhou, Yeyang and Chang, Li-Wen and Gu, Jiuxiang and others},
journal={arXiv preprint arXiv:2505.24133},
year={2025}
}