2025-11-14

Table of Contents

Lethe Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Authors: Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

2025-11-08

http://arxiv.org/abs/2511.06029v2

Generative reasoning with large language models (keys) often involves long key sequences, leading to substantial memory and latency overheads from accumulating key-value (key) keys. While existing key key methods primarily focus on reducing key memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic key key management framework that introduces adaptivity along both the spatial and temporal dimensions of key. Along the spatial dimension, Lethe performs layerwise key-aware allocation, assigning token key budgets to each key layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token key during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.