OBD |
Optimal Brain Damage |
|
 |
|
|
OBD |
Optimal Brain Damage |
|
 |
|
|
OBS |
Optimal Brain Surgeon and general network pruning |
|
 |
|
|
DSD |
DSD: Dense-Sparse-Dense Training for Deep Neural Networks |
|
 |
|
|
L-OBS |
Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon |
|
 |
 |
|
ADMM-pruning |
A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers |
|
 |
 |
|
m |
Fast Sparse ConvNets |
|
 |
 |
|
m |
Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference |
|
 |
|
|
Movement Pruning |
Movement Pruning: Adaptive Sparsity by Fine-Tuning |
 |
 |
 |
|
blocksparse |
GPU Kernels for Block-Sparse Weights |
|
 |
 |
|
OpenVINO |
Post-training deep neural network pruning via layer-wise calibration |
|
 |
|
|
SR-STE |
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch |
 |
 |
 |
|
m |
Channel Permutations for N:M Sparsity |
|
 |
 |
|
NMSparse |
Accelerating Sparse Deep Neural Networks |
|
 |
|
|
m |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks |
|
 |
|
|
m |
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm |
|
 |
 |
|
TextPruner |
TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models |
 |
 |
 |
|
m |
Creating Sparse GPT-3 Models with Iterative Pruning |
|
 |
|
|
SPDY |
SPDY: Accurate Pruning with Speedup Guarantees |
 |
 |
 |
note |
Sprint |
Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation |
|
 |
|
|
FisherPruning |
A Fast Post-Training Pruning Framework for Transformers |
 |
 |
 |
note |
OBC |
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning |
|
 |
 |
|
Complementary Sparsity |
Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks |
 |
 |
|
note |
DSA |
Transformer Acceleration with Dynamic Sparse Attention |
 |
 |
|
note |
STA |
An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers |
|
 |
|
|
oBERT |
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models |
|
 |
 |
|
Diffuser |
Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences |
 |
 |
 |
|
GRAIN |
Gradient-based Intra-attention Pruning on Pre-trained Language Models |
 |
 |
 |
note |
SMP |
Pruning Pre-trained Language Models Without Fine-Tuning |
 |
 |
 |
|
PINS |
Pruning Pre-trained Language Models with Principled Importance and Self-regularization |
|
 |
 |
|
SIMPLE |
Structured Pruning for Efficient Generative Pre-trained Language Models |
 |
 |
|
note |
m |
Structural Pruning of Large Language Models via Neural Architecture Search |
 |
 |
 |
|
SparseViT |
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer |
 |
 |
 |
note |
TorchSparse++ |
TorchSparse++: Efficient Point Cloud Engine |
|
 |
 |
|
MVUE |
Minimum Variance Unbiased N:M Sparsity for the Neural Gradients |
|
 |
|
|
m |
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers |
|
 |
|
|
Deja Vu |
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time |
 |
 |
 |
|
SparseGPT |
SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. |
|
 |
 |
|
LoSparse |
Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation |
 |
 |
 |
|
nmSPARSE |
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning |
|
 |
 |
|
ZipLM |
ZipLM: Inference-Aware Structured Pruning of Language Models |
 |
 |
 |
|
VENOM |
VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores |
 |
 |
 |
note |
SPDF |
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models |
|
 |
|
|
GBLM-Pruner |
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models |
 |
 |
 |
note |
Compresso |
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models |
 |
 |
 |
note |
Adaptively Sparse Attention |
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers |
 |
 |
|
|
KCM |
Gradient-Free Structured Pruning with Unlabeled Data |
 |
 |
|
|
H2O |
HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models |
 |
 |
 |
note |
K-pruning |
Knowledge-preserving Pruning for Pre-trained Language Models without Retraining |
 |
 |
|
note |
LLM in a flash |
LLM in a flash: Efficient Large Language Model Inference with Limited Memory |
 |
 |
|
note |
LLM-Pruner |
LLM-Pruner: On the Structural Pruning of Large Language Models |
 |
 |
 |
note |
LoRAShear |
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery |
|
 |
|
|
PowerInfer |
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU |
|
 |
 |
note |
GBDT |
Pruning Large Language Models via Accuracy Predictor |
 |
 |
|
|
LLM-shearing |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning |
 |
 |
 |
note |
SquareHead |
Sparse Fine-tuning for Inference Acceleration of Large Language Models |
 |
 |
 |
|
Sparse-IFT |
Sparse Iso-FLOP Transformations for Maximizing Training Efficiency |
|
 |
 |
|
SMS |
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging |
 |
 |
 |
|
m |
Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers |
|
 |
|
|
Essential Sparsity |
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter |
|
 |
 |
|
Selective Context |
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering |
 |
 |
 |
|
FLAP |
Fluctuation-based Adaptive Structured Pruning for Large Language Models |
 |
 |
 |
|
CATS |
CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models |
 |
 |
 |
note |
SparseInfer |
SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference |
|
 |
|
note |
SCAP |
Post-Training Statistical Calibration for Higher Activation Sparsity |
 |
 |
 |
note |
Wanda |
A Simple and Effective Pruning Approach for Large Language Models |
 |
 |
 |
note |
LLM-KICK |
Compressing LLMs: The Truth is Rarely Pure and Never Simple |
 |
 |
 |
note |
DSnoT |
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs |
 |
 |
 |
note |
streaming-llm |
Efficient Streaming Language Models with Attention Sinks |
 |
 |
 |
note |
RIA |
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models |
 |
 |
 |
|
SAS |
SAS: Structured Activation Spasification |
 |
 |
 |
note |
SliceGPT |
SliceGPT: Compress Large Language Models by Deleting Rows and Columns |
 |
 |
 |
note |
ReLU Strikes Back |
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models |
 |
 |
 |
|
m |
Accelerating Transformer Pre-training with 2:4 Sparsity |
|
 |
 |
note |
OSSCAR |
OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization |
|
 |
 |
note |
OWL |
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity |
 |
 |
 |
|
Quest |
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference |
 |
 |
 |
note |
SPP |
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models |
 |
 |
 |
note |
SparQ |
SparQ Attention: Bandwidth-Efficient LLM Inference |
|
 |
|
note |
Sparse-IFT |
Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency |
 |
 |
 |
note |
MInference |
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention |
 |
 |
 |
note |
MaskLLM |
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models |
 |
 |
 |
note |
SlimGPT |
SlimGPT: Layer-wise Structured Pruning for Large Language Models |
 |
 |
|
note |
SparseLLM |
SparseLLM: Towards Global Pruning for Pre-trained Language Models |
|
 |
 |
note |
ADMM-pruning |
Fast and Effective Weight Update for Pruned Large Language Models |
|
 |
 |
note |
Flash-LLM |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity |
 |
 |
 |
note |
PWGG5HBE |
A Survey on Large Language Model Acceleration based on KV Cache Management |
 |
 |
 |
note |
AVSS |
AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis |
 |
 |
|
note |
AdaKV |
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference |
 |
 |
 |
note |
m |
Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs |
 |
 |
|
note |
SharedAttention |
Beyond KV Caching: Shared Attention for Efficient LLMs |
 |
 |
 |
note |
Minitron |
Compact Language Models via Pruning and Knowledge Distillation |
 |
 |
 |
note |
CoreInfer |
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation |
 |
 |
 |
note |
DuoAttention |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads |
 |
 |
 |
note |
m |
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment |
|
 |
 |
note |
Bonsa |
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes |
|
 |
 |
|
FlashMask |
FlashMask: Efficient and Rich Mask Extension of FlashAttention |
 |
 |
 |
note |
MoA |
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression |
 |
 |
 |
note |
CHESS |
Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification |
|
 |
Pytorch |
note |
DoubleSparsity |
Post-Training Sparse Attention with Double Sparsity |
|
 |
 |
note |
PowerInfer-2 |
PowerInfer-2: Fast Large Language Model Inference on a Smartphone |
|
 |
Website |
note |
ProSparse |
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models |
 |
 |
 |
note |
Q-Sparse |
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated |
 |
 |
|
note |
ReLU2 |
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs |
 |
 |
|
note |
Recycled Attention |
Recycled Attention: Efficient inference for long-context language models |
 |
 |
 |
note |
SampleAttention |
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention |
 |
 |
|
note |
SeerAttention |
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs |
 |
 |
 |
note |
ShadowLLM |
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models |
 |
 |
 |
note |
SnapKV |
SnapKV: LLM Knows What You are Looking for Before Generation |
 |
 |
 |
note |
TOVA |
Transformers are Multi-State RNNs |
 |
 |
 |
note |
Turbo Sparse |
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters |
|
 |
Pytorch |
note |
ZigZagKV |
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty |
 |
 |
|
note |
AdaSkip |
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference |
 |
 |
 |
note |
AdaptiveSparseTrainer |
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training |
 |
 |
 |
note |
NSA |
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention |
 |
 |
|
note |
SDS |
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism |
 |
 |
|
note |
FlexPrefill |
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference |
 |
 |
 |
note |
R-Sparse |
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference |
 |
 |
 |
note |
ReAttention |
ReAttention: Training-Free Infinite Context with Finite Attention Scope |
 |
 |
 |
note |
TEAL |
Training-Free Activation Sparsity in Large Language Models |
 |
 |
 |
note |
AdaSplash |
AdaSplash: Adaptive Sparse Flash Attention |
|
 |
 |
note |
BaWA |
BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation |
 |
 |
|
note |
CateKV |
CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration |
 |
 |
|
note |
PoD |
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity |
 |
 |
|
note |
HashAttention |
HashAttention: Semantic Sparsity for Faster Inference |
 |
 |
 |
note |
LaRoSA |
La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation |
 |
 |
|
note |
MMInference |
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention |
|
 |
|
note |
ShadowKV |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference |
 |
 |
 |
note |
SlimLLM |
SlimLLM: Accurate Structured Pruning for Large Language Models |
|
 |
|
note |
SpargeAttn |
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference |
 |
 |
 |
note |
SparsingLaw |
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity |
 |
 |
 |
note |
XAttention |
XAttention: Block Sparse Attention with Antidiagonal Scoring |
 |
 |
 |
note |
SpecEE |
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting |
 |
 |
 |
note |
LinearPatch |
A Simple Linear Patch Revives Layer-Pruned Large Language Models |
 |
 |
|
note |
Acc-SpMM |
Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores |
 |
 |
|
note |
07NWF4VE |
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching |
 |
 |
|
note |
SharePrefill |
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing |
 |
 |
|
note |
ACP |
Adaptive Computation Pruning for the Forgetting Transformer |
|
 |
 |
note |
FlexiDepth |
Adaptive Layer-skipping in Pre-trained LLMs |
 |
 |
|
note |
AhaKV |
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models |
|
 |
|
note |
AmberPruner |
Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models |
|
 |
|
note |
AttentionPredictor |
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference |
 |
 |
|
note |
ChunkKV |
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference |
 |
 |
|
note |
DBudgetKV |
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance |
 |
 |
|
note |
DeltaAttention |
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction |
 |
 |
|
note |
DeltaLLM |
DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference |
 |
 |
|
note |
RaaS |
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity |
 |
 |
|
note |
topk-decoding |
Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs |
 |
 |
 |
note |
FastKV |
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation |
 |
 |
 |
note |
FlashInfer |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving |
|
 |
 |
note |
HATA |
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference |
 |
 |
 |
note |
HCAttention |
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs |
 |
 |
|
note |
IFPruning |
Instruction-Following Pruning for Large Language Models |
 |
 |
|
note |
LServer |
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention |
 |
 |
 |
note |
LeanK |
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding |
 |
 |
 |
note |
MiniCPM4 |
MiniCPM4: Ultra-Efficient LLMs on End Devices |
 |
 |
 |
note |
MoSA |
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing |
 |
 |
 |
note |
MoBA |
MoBA: Mixture of Block Attention for Long-Context LLMs |
 |
 |
 |
note |
Mosaic |
Mosaic: Composite Projection Pruning for Resource-efficient LLMs |
 |
 |
|
note |
PowerAttention |
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention |
 |
 |
|
note |
PSA |
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving |
 |
 |
 |
note |
Cus-Prun |
Pruning General Large Language Models into Customized Expert Models |
 |
 |
 |
note |
QuickSilver |
QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization |
|
 |
|
note |
R-KV |
R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration |
 |
 |
 |
note |
RadialAttention |
Radial Attention: Sparse Attention with Energy Decay for Long Video Generation |
|
 |
|
note |
ReSA |
Rectified Sparse Attention |
 |
 |
 |
note |
0VRXJQ3F |
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving |
 |
 |
 |
note |
SALE |
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling |
 |
 |
 |
note |
SEAP |
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models |
 |
 |
 |
note |
SeerAttention-R |
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning |
 |
 |
 |
note |
Awesome-Efficient-Arch |
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models |
 |
 |
 |
note |
SpindleKV |
SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers |
 |
 |
 |
note |
Task-KV |
Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads |
 |
 |
|
note |
Super-Experts-Profilling |
Unveiling Super Experts in Mixture-of-Experts Large Language Models |
 |
 |
 |
note |
attention-gym |
Attention-Gym: Triton-Based Sparse and Quantization Attention |
|
 |
 |
note |