keyword

01-Sparsity (Attention)

Meta Title Cover Publish Code Note
DSA Transformer Acceleration with Dynamic Sparse Attention cover Publish note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
AdaKV Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cover Publish GitHub Repo stars note
SharedAttention Beyond KV Caching: Shared Attention for Efficient LLMs cover Publish GitHub Repo stars note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
KVSink KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs cover Publish note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
ReAttention ReAttention: Training-Free Infinite Context with Finite Attention Scope cover Publish GitHub Repo stars note
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
PoD Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity cover Publish note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
StarAttention Star Attention: Efficient LLM Inference over Long Sequences cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
07NWF4VE Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching cover Publish note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note
AhaKV AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models Publish note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note
ChunkKV ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference cover Publish note
DBudgetKV DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance cover Publish note
DeltaAttention Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction cover Publish note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note
topk-decoding Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs cover Publish GitHub Repo stars note
FastKV FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
FreqKV FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension cover Publish note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note
HCAttention HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs cover Publish note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note
QuickSilver QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note
SALE SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note
Task-KV Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads cover Publish note
sparse-frontier The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs cover Publish GitHub Repo stars note
attention-gym Attention-Gym: Triton-Based Sparse and Quantization Attention Publish GitHub Repo stars note
KVCache-Factory Unified KV Cache Compression Methods for Auto-Regressive Models Publish GitHub Repo stars note
kvpress kvpress: LLM KV cache compression made easy Publish GitHub Repo stars note

02-Sparsity (Activation)

Meta Title Cover Publish Code Note
SparseViT SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer cover Publish GitHub Repo stars note
m The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers Publish
LLM in a flash LLM in a flash: Efficient Large Language Model Inference with Limited Memory cover Publish note
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note
SparseInfer SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference Publish note
SCAP Post-Training Statistical Calibration for Higher Activation Sparsity cover Publish GitHub Repo stars note
SAS SAS: Structured Activation Spasification cover Publish GitHub Repo stars note
ReLU Strikes Back ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models cover Publish GitHub Repo stars
SMAT Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts cover Publish GitHub Repo stars note
CoreInfer CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation cover Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note
LaRoSA La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation cover Publish note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
AmberPruner Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Publish note
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note

03-Sparsity (Weight)

Meta Title Cover Publish Code Note
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
VENOM VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores cover Publish GitHub Repo stars note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
SparseLLM SparseLLM: Towards Global Pruning for Pre-trained Language Models Publish GitHub Repo stars note
ADMM-pruning Fast and Effective Weight Update for Pruned Large Language Models Publish GitHub Repo stars note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note
TorchAO TorchAO: PyTorch-Native Training-to-Serving Model Optimization cover Publish GitHub Repo stars note

04-Sparsity (Structured)

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
SIMPLE Structured Pruning for Efficient Generative Pre-trained Language Models cover Publish note
m Structural Pruning of Large Language Models via Neural Architecture Search cover Publish GitHub Repo stars
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
Compresso Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models cover Publish GitHub Repo stars note
KCM Gradient-Free Structured Pruning with Unlabeled Data cover Publish
K-pruning Knowledge-preserving Pruning for Pre-trained Language Models without Retraining cover Publish note
LLM-Pruner LLM-Pruner: On the Structural Pruning of Large Language Models cover Publish GitHub Repo stars note
LoRAShear LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery Publish
LLM-shearing Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning cover Publish GitHub Repo stars note
FLAP Fluctuation-based Adaptive Structured Pruning for Large Language Models cover Publish GitHub Repo stars
SliceGPT SliceGPT: Compress Large Language Models by Deleting Rows and Columns cover Publish GitHub Repo stars note
OSSCAR OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization Publish GitHub Repo stars note
SlimGPT SlimGPT: Layer-wise Structured Pruning for Large Language Models cover Publish note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note
Bonsa Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes Publish GitHub Repo stars
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
SlimLLM SlimLLM: Accurate Structured Pruning for Large Language Models Publish note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
FlexiDepth Adaptive Layer-skipping in Pre-trained LLMs cover Publish note
Mosaic Mosaic: Composite Projection Pruning for Resource-efficient LLMs cover Publish note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note
SEAP SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models cover Publish GitHub Repo stars note

05-Sparse/Pruning

Meta Title Cover Publish Code Note
OBD Optimal Brain Damage Publish
OBD Optimal Brain Damage Publish
OBS Optimal Brain Surgeon and general network pruning Publish
DSD DSD: Dense-Sparse-Dense Training for Deep Neural Networks Publish
L-OBS Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon Publish GitHub Repo stars
ADMM-pruning A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers Publish GitHub Repo stars
m Fast Sparse ConvNets Publish GitHub Repo stars
m Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference Publish
Movement Pruning Movement Pruning: Adaptive Sparsity by Fine-Tuning cover Publish GitHub Repo stars
blocksparse GPU Kernels for Block-Sparse Weights Publish GitHub Repo stars
OpenVINO Post-training deep neural network pruning via layer-wise calibration Publish
SR-STE Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch cover Publish GitHub Repo stars
m Channel Permutations for N:M Sparsity Publish GitHub Repo stars
NMSparse Accelerating Sparse Deep Neural Networks Publish
m Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Publish
m Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm Publish GitHub Repo stars
TextPruner TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models cover Publish GitHub Repo stars
m Creating Sparse GPT-3 Models with Iterative Pruning Publish
SPDY SPDY: Accurate Pruning with Speedup Guarantees cover Publish GitHub Repo stars note
Sprint Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation Publish
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
Complementary Sparsity Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks cover Publish note
DSA Transformer Acceleration with Dynamic Sparse Attention cover Publish note
STA An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers Publish
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
Diffuser Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences cover Publish GitHub Repo stars
GRAIN Gradient-based Intra-attention Pruning on Pre-trained Language Models cover Publish GitHub Repo stars note
SMP Pruning Pre-trained Language Models Without Fine-Tuning cover Publish GitHub Repo stars
PINS Pruning Pre-trained Language Models with Principled Importance and Self-regularization Publish GitHub Repo stars
SIMPLE Structured Pruning for Efficient Generative Pre-trained Language Models cover Publish note
m Structural Pruning of Large Language Models via Neural Architecture Search cover Publish GitHub Repo stars
SparseViT SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer cover Publish GitHub Repo stars note
TorchSparse++ TorchSparse++: Efficient Point Cloud Engine Publish GitHub Repo stars
MVUE Minimum Variance Unbiased N:M Sparsity for the Neural Gradients Publish
m The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers Publish
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
VENOM VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores cover Publish GitHub Repo stars note
SPDF SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models Publish
GBLM-Pruner Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models cover Publish GitHub Repo stars note
Compresso Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models cover Publish GitHub Repo stars note
Adaptively Sparse Attention Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers cover Publish
KCM Gradient-Free Structured Pruning with Unlabeled Data cover Publish
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
K-pruning Knowledge-preserving Pruning for Pre-trained Language Models without Retraining cover Publish note
LLM in a flash LLM in a flash: Efficient Large Language Model Inference with Limited Memory cover Publish note
LLM-Pruner LLM-Pruner: On the Structural Pruning of Large Language Models cover Publish GitHub Repo stars note
LoRAShear LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery Publish
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
GBDT Pruning Large Language Models via Accuracy Predictor cover Publish
LLM-shearing Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning cover Publish GitHub Repo stars note
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
Sparse-IFT Sparse Iso-FLOP Transformations for Maximizing Training Efficiency Publish GitHub Repo stars
SMS Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging cover Publish GitHub Repo stars
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars
Selective Context Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering cover Publish GitHub Repo stars
FLAP Fluctuation-based Adaptive Structured Pruning for Large Language Models cover Publish GitHub Repo stars
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note
SparseInfer SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference Publish note
SCAP Post-Training Statistical Calibration for Higher Activation Sparsity cover Publish GitHub Repo stars note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
SAS SAS: Structured Activation Spasification cover Publish GitHub Repo stars note
SliceGPT SliceGPT: Compress Large Language Models by Deleting Rows and Columns cover Publish GitHub Repo stars note
ReLU Strikes Back ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models cover Publish GitHub Repo stars
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
OSSCAR OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note
SparQ SparQ Attention: Bandwidth-Efficient LLM Inference Publish note
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
SlimGPT SlimGPT: Layer-wise Structured Pruning for Large Language Models cover Publish note
SparseLLM SparseLLM: Towards Global Pruning for Pre-trained Language Models Publish GitHub Repo stars note
ADMM-pruning Fast and Effective Weight Update for Pruned Large Language Models Publish GitHub Repo stars note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
AVSS AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis cover Publish note
AdaKV Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cover Publish GitHub Repo stars note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
SharedAttention Beyond KV Caching: Shared Attention for Efficient LLMs cover Publish GitHub Repo stars note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note
CoreInfer CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation cover Publish GitHub Repo stars note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note
Bonsa Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes Publish GitHub Repo stars
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
CHESS Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification Publish Pytorch note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
ShadowLLM ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models cover Publish GitHub Repo stars note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note
ReAttention ReAttention: Training-Free Infinite Context with Finite Attention Scope cover Publish GitHub Repo stars note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
PoD Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity cover Publish note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note
LaRoSA La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation cover Publish note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
SlimLLM SlimLLM: Accurate Structured Pruning for Large Language Models Publish note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
Acc-SpMM Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores cover Publish note
07NWF4VE Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching cover Publish note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note
FlexiDepth Adaptive Layer-skipping in Pre-trained LLMs cover Publish note
AhaKV AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models Publish note
AmberPruner Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Publish note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note
ChunkKV ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference cover Publish note
DBudgetKV DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance cover Publish note
DeltaAttention Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction cover Publish note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note
topk-decoding Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs cover Publish GitHub Repo stars note
FastKV FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note
HCAttention HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs cover Publish note
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
Mosaic Mosaic: Composite Projection Pruning for Resource-efficient LLMs cover Publish note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note
QuickSilver QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note
SALE SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling cover Publish GitHub Repo stars note
SEAP SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note
Task-KV Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads cover Publish note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note
attention-gym Attention-Gym: Triton-Based Sparse and Quantization Attention Publish GitHub Repo stars note

06-Quantization

Meta Title Cover Publish Code Note
Deep Compression Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Publish
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
BRECQ BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction Publish GitHub Repo stars
GPFQ A Greedy Algorithm for Quantizing Neural Networks Publish GitHub Repo stars
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
ZeroQuant ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Publish GitHub Repo stars
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note
OmniQuant OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models cover Publish GitHub Repo stars
GPFQv2 Post-training Quantization for Neural Networks with Provable Guarantees Publish GitHub Repo stars
QLoRA QLoRA: Efficient Finetuning of Quantized LLMs cover Publish GitHub Repo stars
QuIP QuIP: Quantization with Incoherence Processing Publish GitHub Repo stars
RPTQ RPTQ: Reorder-based Post-training Quantization for Large Language Models Publish GitHub Repo stars note
SpQR SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Publish GitHub Repo stars
m Training Transformers with 4-bit Integers Publish GitHub Repo stars
ZeroQuant-V2 ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation Publish GitHub Repo stars
QA-LoRA QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models cover Publish GitHub Repo stars note
FrameQuant FrameQuant: Flexible Low-Bit Quantization for Transformers cover Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note
L4Q L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ cover Publish note
MiniKV MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache cover Publish note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
TorchAO TorchAO: PyTorch-Native Training-to-Serving Model Optimization cover Publish GitHub Repo stars note
CCQ CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs Publish note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note
attention-gym Attention-Gym: Triton-Based Sparse and Quantization Attention Publish GitHub Repo stars note

07-Communication-Computation Overlap

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note
XLA Overlap Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models cover Publish note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
T3 T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives cover Publish note
DistributedGEMM A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems cover Publish GitHub Repo stars note
Domino Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping cover Publish GitHub Repo stars note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
Async-TP [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
UC0D8DJ6 Characterizing Communication Patterns in Distributed Large Language Model Inference Publish note
1DZIJVBI Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications cover Publish note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note
HelixParallelism Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding Publish note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
Seesaw Seesaw: High-throughput LLM Inference via Model Re-sharding cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
TokenWeave TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

08-Performance Modeling

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
APEX APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving Publish GitHub Repo stars note
AMALI AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs cover Publish note
LIMINAL Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need Publish note

09-LLM Deployment

Meta Title Cover Publish Code Note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
m Efficient Guided Generation for Large Language Models cover Publish note
ChunkAttention ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition cover Publish GitHub Repo stars note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
EAGLE EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
YS9YTT55 LLM Inference Serving: Survey of Recent Advances and Opportunities Publish note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note
HelixParallelism Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding Publish note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note
MIRAGE MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Publish note

10-Survey

Meta Title Cover Publish Code Note
m Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Publish
m Efficient Methods for Natural Language Processing: A Survey cover Publish
m A Survey on Evaluation of Large Language Models cover Publish
m A Survey on Model Compression for Large Language Models cover Publish
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note
m A Survey on Efficient Inference for Large Language Models cover Publish note
068ZPAME A Survey on Inference Optimization Techniques for Mixture of Experts Models cover Publish GitHub Repo stars note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
YS9YTT55 LLM Inference Serving: Survey of Recent Advances and Opportunities Publish note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note
209M5GA7 KV Cache Compression for Inference Efficiency in LLMs: A Review Publish note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note
sparse-frontier The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs cover Publish GitHub Repo stars note

11-Network Structure Design

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
068ZPAME A Survey on Inference Optimization Techniques for Mixture of Experts Models cover Publish GitHub Repo stars note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
MoD Mixture-of-Depths: Dynamically allocating compute in transformer-based language models cover Publish note
MFA Multi-matrix Factorization Attention Publish note
ReMoE ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing cover Publish GitHub Repo stars note
CLA Reducing Transformer Key-Value Cache Size with Cross-Layer Attention cover Publish GitHub Repo stars note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
MoE-MLA-RoPE Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note
2ZU1IWL6 Fast and Simplex: 2-Simplicial Attention in Triton Publish note
GLA Hardware-Efficient Attention for Fast Decoding cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note
Qwen3 Qwen3 Technical Report cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note

12-Low Rank Decomposition

Meta Title Cover Publish Code Note
LoRA LoRA: Low-rank adaptation of large language models cover Publish GitHub Repo stars
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
LoRAShear LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery Publish
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note
QLoRA QLoRA: Efficient Finetuning of Quantized LLMs cover Publish GitHub Repo stars
QA-LoRA QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models cover Publish GitHub Repo stars note
L4Q L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ cover Publish note

13-KV Cache Optimization/Efficient Attention

Meta Title Cover Publish Code Note
DSA Transformer Acceleration with Dynamic Sparse Attention cover Publish note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
Flash-Decoding Flash-Decoding for long-context inference Publish note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
ChunkAttention ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition cover Publish GitHub Repo stars note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
SparQ SparQ Attention: Bandwidth-Efficient LLM Inference Publish note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
AdaKV Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cover Publish GitHub Repo stars note
SharedAttention Beyond KV Caching: Shared Attention for Efficient LLMs cover Publish GitHub Repo stars note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note
MiniKV MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
MFA Multi-matrix Factorization Attention Publish note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note
CLA Reducing Transformer Key-Value Cache Size with Cross-Layer Attention cover Publish GitHub Repo stars note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note
KVSink KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs cover Publish note
CacheBlend CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion cover Publish GitHub Repo stars note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
ReAttention ReAttention: Training-Free Infinite Context with Finite Attention Scope cover Publish GitHub Repo stars note
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
PoD Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity cover Publish note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
StarAttention Star Attention: Efficient LLM Inference over Long Sequences cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
07NWF4VE Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching cover Publish note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note
AhaKV AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models Publish note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note
ChunkKV ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference cover Publish note
DBudgetKV DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance cover Publish note
DeltaAttention Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction cover Publish note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note
topk-decoding Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs cover Publish GitHub Repo stars note
2ZU1IWL6 Fast and Simplex: 2-Simplicial Attention in Triton Publish note
FastKV FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
FreqKV FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension cover Publish note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note
HCAttention HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs cover Publish note
GLA Hardware-Efficient Attention for Fast Decoding cover Publish GitHub Repo stars note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note
209M5GA7 KV Cache Compression for Inference Efficiency in LLMs: A Review Publish note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
MIRAGE MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Publish note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note
QuickSilver QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note
SALE SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note
Task-KV Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads cover Publish note
KVCache-Factory Unified KV Cache Compression Methods for Auto-Regressive Models Publish GitHub Repo stars note
kvpress kvpress: LLM KV cache compression made easy Publish GitHub Repo stars note

14-Layer Fusion (Reduce IO)

Meta Title Cover Publish Code Note
FlashAttention FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness cover Publish GitHub Repo stars
FlashAttention-2 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Publish GitHub Repo stars
GLA Hardware-Efficient Attention for Fast Decoding cover Publish GitHub Repo stars note

15-Efficient Training

Meta Title Cover Publish Code Note
LoRA LoRA: Low-rank adaptation of large language models cover Publish GitHub Repo stars
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
MeZO Fine-Tuning Language Models with Just Forward Passes Publish GitHub Repo stars note
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
LoRA+ LoRA+: Efficient Low Rank Adaptation of Large Models Publish GitHub Repo stars note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
TinyTrain TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge Publish GitHub Repo stars note
LISA LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Publish note
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note

16-Tool

Meta Title Cover Publish Code Note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
FT FasterTransformer Publish GitHub Repo stars
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
TorchAO TorchAO: PyTorch-Native Training-to-Serving Model Optimization cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
attention-gym Attention-Gym: Triton-Based Sparse and Quantization Attention Publish GitHub Repo stars note
KVCache-Factory Unified KV Cache Compression Methods for Auto-Regressive Models Publish GitHub Repo stars note
kvpress kvpress: LLM KV cache compression made easy Publish GitHub Repo stars note