By Keyword

keyword

01-Sparsity (Attention)

Meta	Title	Note
DSA	Transformer Acceleration with Dynamic Sparse Attention	note
Adaptively Sparse Attention	Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management	note
AdaKV	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference	note
SharedAttention	Beyond KV Caching: Shared Attention for Efficient LLMs	note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention	note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
Recycled Attention	Recycled Attention: Efficient inference for long-context language models	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation	note
TOVA	Transformers are Multi-State RNNs	note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope	note
TidalDecode	TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention	note
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
StarAttention	Star Attention: Efficient LLM Inference over Long Sequences	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction	note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching	note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing	note
ACP	Adaptive Computation Pruning for the Forgetting Transformer	note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models	note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference	note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference	note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance	note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference	note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity	note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference	note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs	note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation	note
FasterVGGT	Faster VGGT with Block-Sparse Global Attention	note
FSA	Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension	note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference	note
HCAttention	HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs	note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse	note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing	note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation	note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference	note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention	note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note
ReSA	Rectified Sparse Attention	note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation	note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers	note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads	note
sparse-frontier	The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs	note
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention	note
KVCache-Factory	Unified KV Cache Compression Methods for Auto-Regressive Models	note
kvpress	kvpress: LLM KV cache compression made easy	note

02-Sparsity (Activation)

Meta	Title	Code	Note
SparseViT	SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer		note
m	The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory		note
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models		note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference		note
SCAP	Post-Training Statistical Calibration for Higher Activation Sparsity		note
SAS	SAS: Structured Activation Spasification		note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts		note
CoreInfer	CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated		note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity		note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference		note
TEAL	Training-Free Activation Sparsity in Large Language Models		note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation		note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity		note
AmberPruner	Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models		note
IFPruning	Instruction-Following Pruning for Large Language Models		note

03-Sparsity (Weight)

Meta	Title	Note
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores	note
Wanda	A Simple and Effective Pruning Approach for Large Language Models	note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	note
SparseLLM	SparseLLM: Towards Global Pruning for Pre-trained Language Models	note
ADMM-pruning	Fast and Effective Weight Update for Pruned Large Language Models	note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization	note

04-Sparsity (Structured)

Meta	Title	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
SIMPLE	Structured Pruning for Efficient Generative Pre-trained Language Models	note
m	Structural Pruning of Large Language Models via Neural Architecture Search
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models	note
KCM	Gradient-Free Structured Pruning with Unlabeled Data
K-pruning	Knowledge-preserving Pruning for Pre-trained Language Models without Retraining	note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models	note
LoRAShear	LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
LLM-shearing	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	note
FLAP	Fluctuation-based Adaptive Structured Pruning for Large Language Models
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	note
OSSCAR	OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization	note
SlimGPT	SlimGPT: Layer-wise Structured Pruning for Large Language Models	note
Minitron	Compact Language Models via Pruning and Knowledge Distillation	note
Bonsa	Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models	note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting	note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization	note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models	note
FlexiDepth	Adaptive Layer-skipping in Pre-trained LLMs	note
DReSS	DReSS: Data-driven Regularized Structured Streamlining for Large Language Models	note
Mosaic	Mosaic: Composite Projection Pruning for Resource-efficient LLMs	note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models	note
SEAP	SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models	note

05-Sparse/Pruning

Meta	Title	Code	Note
OBD	Optimal Brain Damage
OBS	Optimal Brain Surgeon and general network pruning
DSD	DSD: Dense-Sparse-Dense Training for Deep Neural Networks
L-OBS	Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
ADMM-pruning	A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
m	Fast Sparse ConvNets
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
Movement Pruning	Movement Pruning: Adaptive Sparsity by Fine-Tuning
blocksparse	GPU Kernels for Block-Sparse Weights
OpenVINO	Post-training deep neural network pruning via layer-wise calibration
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
m	Channel Permutations for N:M Sparsity
NMSparse	Accelerating Sparse Deep Neural Networks
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
m	Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
TextPruner	TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
m	Creating Sparse GPT-3 Models with Iterative Pruning
SPDY	SPDY: Accurate Pruning with Speedup Guarantees		note
Sprint	Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation
FisherPruning	A Fast Post-Training Pruning Framework for Transformers		note
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
Complementary Sparsity	Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks		note
DSA	Transformer Acceleration with Dynamic Sparse Attention		note
STA	An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
Diffuser	Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences
GRAIN	Gradient-based Intra-attention Pruning on Pre-trained Language Models		note
SMP	Pruning Pre-trained Language Models Without Fine-Tuning
PINS	Pruning Pre-trained Language Models with Principled Importance and Self-regularization
SIMPLE	Structured Pruning for Efficient Generative Pre-trained Language Models		note
m	Structural Pruning of Large Language Models via Neural Architecture Search
SparseViT	SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer		note
TorchSparse++	TorchSparse++: Efficient Point Cloud Engine
MVUE	Minimum Variance Unbiased N:M Sparsity for the Neural Gradients
m	The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores		note
SPDF	SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
GBLM-Pruner	Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models		note
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models		note
Adaptively Sparse Attention	Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
KCM	Gradient-Free Structured Pruning with Unlabeled Data
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models		note
K-pruning	Knowledge-preserving Pruning for Pre-trained Language Models without Retraining		note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory		note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models		note
LoRAShear	LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
GBDT	Pruning Large Language Models via Accuracy Predictor
LLM-shearing	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning		note
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
SMS	Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
Selective Context	Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
FLAP	Fluctuation-based Adaptive Structured Pruning for Large Language Models
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models		note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference		note
SCAP	Post-Training Statistical Calibration for Higher Activation Sparsity		note
Wanda	A Simple and Effective Pruning Approach for Large Language Models		note
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple		note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs		note
streaming-llm	Efficient Streaming Language Models with Attention Sinks		note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
SAS	SAS: Structured Activation Spasification		note
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns		note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
m	Accelerating Transformer Pre-training with 2:4 Sparsity		note
OSSCAR	OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization		note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference		note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models		note
SparQ	SparQ Attention: Bandwidth-Efficient LLM Inference		note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency		note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention		note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models		note
SlimGPT	SlimGPT: Layer-wise Structured Pruning for Large Language Models		note
SparseLLM	SparseLLM: Towards Global Pruning for Pre-trained Language Models		note
ADMM-pruning	Fast and Effective Weight Update for Pruned Large Language Models		note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity		note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management		note
AVSS	AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis		note
AdaKV	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference		note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs		note
SharedAttention	Beyond KV Caching: Shared Attention for Efficient LLMs		note
Minitron	Compact Language Models via Pruning and Knowledge Distillation		note
CoreInfer	CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation		note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment		note
Bonsa	Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention		note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference		note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression		note
CHESS	Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification	Pytorch	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated		note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Recycled Attention	Recycled Attention: Efficient inference for long-context language models		note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention		note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs		note
ShadowLLM	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models		note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation		note
TOVA	Transformers are Multi-State RNNs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty		note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification		note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference		note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training		note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention		note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism		note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference		note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference		note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope		note
TidalDecode	TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention		note
TEAL	Training-Free Activation Sparsity in Large Language Models		note
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention		note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation		note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration		note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity		note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference		note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation		note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention		note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference		note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models		note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference		note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity		note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring		note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting		note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving		note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction		note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs		note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization		note
Acc-SpMM	Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores		note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models		note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching		note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing		note
ACP	Adaptive Computation Pruning for the Forgetting Transformer		note
FlexiDepth	Adaptive Layer-skipping in Pre-trained LLMs		note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models		note
AmberPruner	Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models		note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference		note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference		note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance		note
DReSS	DReSS: Data-driven Regularized Structured Streamlining for Large Language Models		note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference		note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity		note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference		note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs		note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation		note
FasterVGGT	Faster VGGT with Block-Sparse Global Attention		note
FSA	Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel		note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving		note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference		note
HCAttention	HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs		note
IFPruning	Instruction-Following Pruning for Large Language Models		note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding		note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices		note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing		note
Mosaic	Mosaic: Composite Projection Pruning for Resource-efficient LLMs		note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference		note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention		note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving		note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models		note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization		note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration		note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation		note
ReSA	Rectified Sparse Attention		note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation		note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling		note
SEAP	SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models		note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning		note
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning		note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models		note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers		note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads		note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models		note
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention		note

06-Quantization (KV Cache)

Meta	Title	Note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	note
MiniKV	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache	note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization	note
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead	note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference	note
KVmix	KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache	note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	note

07-Quantization

Meta	Title	Code	Note
Deep Compression	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
GPFQ	A Greedy Algorithm for Quantizing Neural Networks
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
ZeroQuant	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models		note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
GPFQv2	Post-training Quantization for Neural Networks with Provable Guarantees
QLoRA	QLoRA: Efficient Finetuning of Quantized LLMs
QuIP	QuIP: Quantization with Incoherence Processing
RPTQ	RPTQ: Reorder-based Post-training Quantization for Large Language Models		note
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
m	Training Transformers with 4-bit Integers
ZeroQuant-V2	ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models		note
FrameQuant	FrameQuant: Flexible Low-Bit Quantization for Transformers		note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache		note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization		note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization		note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification		note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM		note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ		note
MiniKV	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache		note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization		note
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization		note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization		note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration		note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead		note
MXFP4Train	Training LLMs with MXFP4		note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving		note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization		note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training		note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference		note
WGM	Binary Quantization For LLMs Through Dynamic Grouping		note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs		note
KVmix	KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache		note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts		note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations		note
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention		note

08-Communication-Computation Overlap

Meta	Title	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads	note
Dist-Einsum	Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models	note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	note
T3	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives	note
DistGEMM	A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems	note
Async-TP	[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch	note
Domino	Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping	note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput	note
UC0D8DJ6	Characterizing Communication Patterns in Distributed Large Language Model Inference	note
1DZIJVBI	Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
HelixParallelism	Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
TokenWeave	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note
DeepEP	DeepEP: an efficient expert-parallel communication library	note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	note

09-Performance Modeling

Meta	Title	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference	note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving	note
AMALI	AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs	note
LIMINAL	Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need	note

10-LLM Deployment

Meta	Title	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
m	Efficient Guided Generation for Large Language Models	note
ChunkAttention	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	note
Splitwise	Splitwise: Efficient generative LLM inference using phase splitting	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	note
YS9YTT55	LLM Inference Serving: Survey of Recent Advances and Opportunities	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note
HelixParallelism	Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding	note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation	note
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving	note

11-Survey

Meta	Title	Note
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
m	Efficient Methods for Natural Language Processing: A Survey
m	A Survey on Evaluation of Large Language Models
m	A Survey on Model Compression for Large Language Models
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption	note
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple	note
m	A Survey on Efficient Inference for Large Language Models	note
068ZPAME	A Survey on Inference Optimization Techniques for Mixture of Experts Models	note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management	note
YS9YTT55	LLM Inference Serving: Survey of Recent Advances and Opportunities	note
massive-activations	Massive Activations in Large Language Models	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note
209M5GA7	KV Cache Compression for Inference Efficiency in LLMs: A Review	note
52A7RO95	Mixture of Experts in Large Language Models	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note
sparse-frontier	The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs	note

12-Network Structure Design

Meta	Title	Note
Transformer	Attention Is All You Need	note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
068ZPAME	A Survey on Inference Optimization Techniques for Mixture of Experts Models	note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MoD	Mixture-of-Depths: Dynamically allocating compute in transformer-based language models	note
MFA	Multi-matrix Factorization Attention	note
ReMoE	ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing	note
CLA	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention	note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning	note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate	note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA	note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting	note
MoE-MLA-RoPE	Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models	note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note
2ZU1IWL6	Fast and Simplex: 2-Simplicial Attention in Triton	note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note
GLA	Hardware-Efficient Attention for Fast Decoding	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note
52A7RO95	Mixture of Experts in Large Language Models	note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing	note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation	note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs	note
Qwen3	Qwen3 Technical Report	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding	note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models	note

13-Low Rank Decomposition

Meta	Title	Note
LoRA	LoRA: Low-rank adaptation of large language models
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LoRAShear	LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	note
QLoRA	QLoRA: Efficient Finetuning of Quantized LLMs
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ	note

14-KV Cache Optimization/Efficient Attention

Meta	Title	Note
DSA	Transformer Acceleration with Dynamic Sparse Attention	note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
Flash-Decoding	Flash-Decoding for long-context inference	note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
ChunkAttention	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
SparQ	SparQ Attention: Bandwidth-Efficient LLM Inference	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management	note
AdaKV	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference	note
SharedAttention	Beyond KV Caching: Shared Attention for Efficient LLMs	note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention	note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
MiniKV	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache	note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
MFA	Multi-matrix Factorization Attention	note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	note
Recycled Attention	Recycled Attention: Efficient inference for long-context language models	note
CLA	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention	note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation	note
TOVA	Transformers are Multi-State RNNs	note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs	note
CacheBlend	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope	note
TidalDecode	TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention	note
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
StarAttention	Star Attention: Efficient LLM Inference over Long Sequences	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction	note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference	note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching	note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing	note
ACP	Adaptive Computation Pruning for the Forgetting Transformer	note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models	note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference	note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference	note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance	note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference	note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity	note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference	note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs	note
2ZU1IWL6	Fast and Simplex: 2-Simplicial Attention in Triton	note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation	note
FSA	Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension	note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference	note
HCAttention	HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs	note
GLA	Hardware-Efficient Attention for Fast Decoding	note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation	note
209M5GA7	KV Cache Compression for Inference Efficiency in LLMs: A Review	note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse	note
KVmix	KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache	note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference	note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing	note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference	note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention	note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note
ReSA	Rectified Sparse Attention	note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation	note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning	note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers	note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads	note
KVCache-Factory	Unified KV Cache Compression Methods for Auto-Regressive Models	note
kvpress	kvpress: LLM KV cache compression made easy	note

15-Layer Fusion (Reduce IO)

Meta	Title	Note
FlashAttention	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention-2	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
GLA	Hardware-Efficient Attention for Fast Decoding	note

16-Efficient Training

Meta	Title	Note
LoRA	LoRA: Low-rank adaptation of large language models
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
MeZO	Fine-Tuning Language Models with Just Forward Passes	note
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	note
m	Accelerating Transformer Pre-training with 2:4 Sparsity	note
LoRA+	LoRA+: Efficient Low Rank Adaptation of Large Models	note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models	note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note
TinyTrain	TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge	note
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark	note
MXFP4Train	Training LLMs with MXFP4	note

17-Tool

Meta	Title	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
FT	FasterTransformer
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention	note
DeepEP	DeepEP: an efficient expert-parallel communication library	note
KVCache-Factory	Unified KV Cache Compression Methods for Auto-Regressive Models	note
kvpress	kvpress: LLM KV cache compression made easy	note

18-Benchmark

Meta	Title	Cover	Publish	Code	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens				note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note