2025-11-14

Generative reasoning with large language models (s) often involves long sequences, leading to substantial memory and latency overheads from accumulating key-value () s. While existing methods primarily focus on reducing memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic management framework that introduces adaptivity along both the spatial and temporal dimensions of . Along the spatial dimension, Lethe performs layerwise -aware allocation, assigning token budgets to each layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

2025-11-14

Table of Contents

Lethe Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving