AdaSkip |
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference |
 |
 |
 |
note |
AdaptiveSparseTrainer |
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training |
 |
 |
 |
note |
NSA |
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention |
 |
 |
|
note |
COMET |
COMET: Towards Partical W4A4KV4 LLMs Serving |
 |
 |
|
note |
POD-Attention |
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference |
 |
 |
 |
note |
vAttention |
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention |
 |
 |
 |
note |
BlockFFN |
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity |
 |
 |
 |
note |
KVSink |
KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs |
 |
 |
|
note |
SDS |
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism |
 |
 |
|
note |
CacheBlend |
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion |
 |
 |
 |
note |
FlexPrefill |
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference |
 |
 |
 |
note |
FoX |
Forgetting Transformer: Softmax Attention with a Forget Gate |
|
 |
 |
note |
R-Sparse |
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference |
 |
 |
 |
note |
ReAttention |
ReAttention: Training-Free Infinite Context with Finite Attention Scope |
 |
 |
 |
note |
RecursiveTransformers |
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA |
 |
 |
|
note |
TEAL |
Training-Free Activation Sparsity in Large Language Models |
 |
 |
 |
note |
AdaSplash |
AdaSplash: Adaptive Sparse Flash Attention |
|
 |
 |
note |
BaWA |
BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation |
 |
 |
|
note |
CateKV |
CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration |
 |
 |
|
note |
PoD |
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity |
 |
 |
|
note |
HashAttention |
HashAttention: Semantic Sparsity for Faster Inference |
 |
 |
 |
note |
LaRoSA |
La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation |
 |
 |
|
note |
MMInference |
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention |
|
 |
|
note |
ShadowKV |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference |
 |
 |
 |
note |
SlimLLM |
SlimLLM: Accurate Structured Pruning for Large Language Models |
|
 |
|
note |
SpargeAttn |
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference |
 |
 |
 |
note |
SparsingLaw |
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity |
 |
 |
 |
note |
StarAttention |
Star Attention: Efficient LLM Inference over Long Sequences |
 |
 |
 |
note |
XAttention |
XAttention: Block Sparse Attention with Antidiagonal Scoring |
 |
 |
 |
note |
TorchAO |
TorchAO: PyTorch-Native Training-to-Serving Model Optimization |
 |
 |
 |
note |
AMALI |
AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs |
 |
 |
|
note |
SpecEE |
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting |
 |
 |
 |
note |
MoE-MLA-RoPE |
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models |
|
 |
|
note |
NanoFlow |
NanoFlow: Towards Optimal Large Language Model Serving Throughput |
 |
 |
 |
note |
LinearPatch |
A Simple Linear Patch Revives Layer-Pruned Large Language Models |
 |
 |
|
note |
Acc-SpMM |
Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores |
 |
 |
|
note |
07NWF4VE |
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching |
 |
 |
|
note |
SharePrefill |
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing |
 |
 |
|
note |
ACP |
Adaptive Computation Pruning for the Forgetting Transformer |
|
 |
 |
note |
FlexiDepth |
Adaptive Layer-skipping in Pre-trained LLMs |
 |
 |
|
note |
AhaKV |
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models |
|
 |
|
note |
AmberPruner |
Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models |
|
 |
|
note |
AttentionPredictor |
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference |
 |
 |
|
note |
CCQ |
CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs |
|
 |
|
note |
UC0D8DJ6 |
Characterizing Communication Patterns in Distributed Large Language Model Inference |
|
 |
|
note |
1DZIJVBI |
Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications |
 |
 |
|
note |
ChunkKV |
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference |
 |
 |
|
note |
CometSeed |
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts |
 |
 |
 |
note |
DBudgetKV |
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance |
 |
 |
|
note |
DeepSeek-R1 |
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |
 |
 |
 |
note |
DeltaAttention |
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction |
 |
 |
|
note |
DeltaLLM |
DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference |
 |
 |
|
note |
LIMINAL |
Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need |
|
 |
|
note |
RaaS |
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity |
 |
 |
|
note |
topk-decoding |
Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs |
 |
 |
 |
note |
2ZU1IWL6 |
Fast and Simplex: 2-Simplicial Attention in Triton |
|
 |
|
note |
FastKV |
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation |
 |
 |
 |
note |
FlashInfer |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving |
|
 |
 |
note |
FlashOverlap |
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation |
 |
 |
 |
note |
FreqKV |
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension |
 |
 |
|
note |
HATA |
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference |
 |
 |
 |
note |
HCAttention |
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs |
 |
 |
|
note |
GLA |
Hardware-Efficient Attention for Fast Decoding |
 |
 |
 |
note |
HelixParallelism |
Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding |
|
 |
|
note |
Adrenaline |
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation |
 |
 |
 |
note |
IFPruning |
Instruction-Following Pruning for Large Language Models |
 |
 |
|
note |
209M5GA7 |
KV Cache Compression for Inference Efficiency in LLMs: A Review |
|
 |
|
note |
KVLink |
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse |
 |
 |
 |
note |
KeepKV |
KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference |
 |
 |
|
note |
LServer |
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention |
 |
 |
 |
note |
LeanK |
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding |
 |
 |
 |
note |
MIRAGE |
MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving |
|
 |
|
note |
MegaScale-MoE |
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production |
 |
 |
|
note |
MiniCPM4 |
MiniCPM4: Ultra-Efficient LLMs on End Devices |
 |
 |
 |
note |
52A7RO95 |
Mixture of Experts in Large Language Models |
 |
 |
|
note |
MoSA |
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing |
 |
 |
 |
note |
MoR |
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation |
 |
 |
 |
note |
MoBA |
MoBA: Mixture of Block Attention for Long-Context LLMs |
 |
 |
 |
note |
Mosaic |
Mosaic: Composite Projection Pruning for Resource-efficient LLMs |
 |
 |
|
note |
PanguUltra |
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs |
|
 |
|
note |
PowerAttention |
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention |
 |
 |
|
note |
PSA |
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving |
 |
 |
 |
note |
Cus-Prun |
Pruning General Large Language Models into Customized Expert Models |
 |
 |
 |
note |
QuickSilver |
QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization |
|
 |
|
note |
Qwen3 |
Qwen3 Technical Report |
 |
 |
 |
note |
R-KV |
R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration |
 |
 |
 |
note |
RadialAttention |
Radial Attention: Sparse Attention with Energy Decay for Long Video Generation |
|
 |
|
note |
ReSA |
Rectified Sparse Attention |
 |
 |
 |
note |
0VRXJQ3F |
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving |
 |
 |
 |
note |
SALE |
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling |
 |
 |
 |
note |
SEAP |
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models |
 |
 |
 |
note |
SageAttention3 |
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training |
|
 |
 |
note |
SeerAttention-R |
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning |
 |
 |
 |
note |
Seesaw |
Seesaw: High-throughput LLM Inference via Model Re-sharding |
 |
 |
|
note |
Awesome-Efficient-Arch |
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models |
 |
 |
 |
note |
SpindleKV |
SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers |
 |
 |
 |
note |
Step-3 |
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding |
|
 |
|
note |
Task-KV |
Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads |
 |
 |
|
note |
sparse-frontier |
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs |
 |
 |
 |
note |
TileLink |
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives |
 |
 |
 |
note |
TokenWeave |
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference |
 |
 |
 |
note |
Triton-distributed |
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler |
 |
 |
 |
note |
Super-Experts-Profilling |
Unveiling Super Experts in Mixture-of-Experts Large Language Models |
 |
 |
 |
note |
attention-gym |
Attention-Gym: Triton-Based Sparse and Quantization Attention |
|
 |
 |
note |
KVCache-Factory |
Unified KV Cache Compression Methods for Auto-Regressive Models |
|
 |
 |
note |
kvpress |
kvpress: LLM KV cache compression made easy |
|
 |
 |
note |