blocksparse |
GPU Kernels for Block-Sparse Weights |
|
 |
 |
|
NMSparse |
Accelerating Sparse Deep Neural Networks |
|
 |
|
|
m |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks |
|
 |
|
|
oBERT |
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models |
|
 |
 |
|
m |
A Survey on Evaluation of Large Language Models |
 |
 |
|
|
m |
A Survey on Model Compression for Large Language Models |
 |
 |
|
|
GBLM-Pruner |
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models |
 |
 |
 |
note |
CodeGeeX |
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X |
|
 |
 |
note |
Compresso |
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models |
 |
 |
 |
note |
Adaptively Sparse Attention |
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers |
 |
 |
|
|
m |
Efficient Guided Generation for Large Language Models |
 |
 |
|
note |
MeZO |
Fine-Tuning Language Models with Just Forward Passes |
|
 |
 |
note |
Flash-Decoding |
Flash-Decoding for long-context inference |
|
 |
|
note |
KCM |
Gradient-Free Structured Pruning with Unlabeled Data |
 |
 |
|
|
H2O |
HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models |
 |
 |
 |
note |
K-pruning |
Knowledge-preserving Pruning for Pre-trained Language Models without Retraining |
 |
 |
|
note |
LLM in a flash |
LLM in a flash: Efficient Large Language Model Inference with Limited Memory |
 |
 |
|
note |
LLM-Pruner |
LLM-Pruner: On the Structural Pruning of Large Language Models |
 |
 |
 |
note |
LoRAShear |
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery |
|
 |
|
|
LoftQ |
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models |
 |
 |
 |
note |
OmniQuant |
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models |
 |
 |
 |
|
GPFQv2 |
Post-training Quantization for Neural Networks with Provable Guarantees |
|
 |
 |
|
PowerInfer |
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU |
|
 |
 |
note |
GBDT |
Pruning Large Language Models via Accuracy Predictor |
 |
 |
|
|
QLoRA |
QLoRA: Efficient Finetuning of Quantized LLMs |
 |
 |
 |
|
QuIP |
QuIP: Quantization with Incoherence Processing |
|
 |
 |
|
RPTQ |
RPTQ: Reorder-based Post-training Quantization for Large Language Models |
|
 |
 |
note |
LLM-shearing |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning |
 |
 |
 |
note |
SpQR |
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |
|
 |
 |
|
SquareHead |
Sparse Fine-tuning for Inference Acceleration of Large Language Models |
 |
 |
 |
|
Sparse-IFT |
Sparse Iso-FLOP Transformations for Maximizing Training Efficiency |
|
 |
 |
|
SMS |
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging |
 |
 |
 |
|
m |
Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers |
|
 |
|
|
Essential Sparsity |
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter |
|
 |
 |
|
m |
Training Transformers with 4-bit Integers |
|
 |
 |
|
Selective Context |
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering |
 |
 |
 |
|
ZeroQuant-V2 |
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation |
|
 |
 |
|
m |
A Survey on Efficient Inference for Large Language Models |
 |
 |
|
note |
068ZPAME |
A Survey on Inference Optimization Techniques for Mixture of Experts Models |
 |
 |
 |
note |
PWGG5HBE |
A Survey on Large Language Model Acceleration based on KV Cache Management |
 |
 |
 |
note |
APEX |
APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving |
|
 |
 |
note |
AVSS |
AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis |
 |
 |
|
note |
AdaKV |
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference |
 |
 |
 |
note |
m |
Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs |
 |
 |
|
note |
SharedAttention |
Beyond KV Caching: Shared Attention for Efficient LLMs |
 |
 |
 |
note |
Minitron |
Compact Language Models via Pruning and Knowledge Distillation |
 |
 |
 |
note |
CoreInfer |
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation |
 |
 |
 |
note |
DeepSeek-V2 |
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |
 |
 |
 |
note |
DeepSeek-V3 |
DeepSeek-V3 Technical Report |
 |
 |
 |
note |
DeepSeekMoE |
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models |
 |
 |
 |
note |
Domino |
Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping |
 |
 |
 |
note |
DuoAttention |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads |
 |
 |
 |
note |
m |
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment |
|
 |
 |
note |
Bonsa |
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes |
|
 |
 |
|
FLUX |
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion |
 |
 |
|
note |
FlashMask |
FlashMask: Efficient and Rich Mask Extension of FlashAttention |
 |
 |
 |
note |
DistAttention |
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache |
 |
 |
|
note |
KVQuant |
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization |
|
 |
 |
note |
L4Q |
L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ |
 |
 |
|
note |
LISA |
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning |
|
 |
|
note |
YS9YTT55 |
LLM Inference Serving: Survey of Recent Advances and Opportunities |
|
 |
|
note |
massive-activations |
Massive Activations in Large Language Models |
 |
 |
 |
note |
MiniKV |
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache |
 |
 |
|
note |
MoD |
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models |
 |
 |
|
note |
MoA |
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression |
 |
 |
 |
note |
MFA |
Multi-matrix Factorization Attention |
|
 |
|
note |
CHESS |
Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification |
|
 |
Pytorch |
note |
DoubleSparsity |
Post-Training Sparse Attention with Double Sparsity |
|
 |
 |
note |
PowerInfer-2 |
PowerInfer-2: Fast Large Language Model Inference on a Smartphone |
|
 |
Website |
note |
ProSparse |
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models |
 |
 |
 |
note |
Q-Sparse |
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated |
 |
 |
|
note |
QServe |
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving |
|
 |
Pytorch |
note |
ReLU2 |
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs |
 |
 |
|
note |
ReMoE |
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing |
 |
 |
 |
note |
Recycled Attention |
Recycled Attention: Efficient inference for long-context language models |
 |
 |
 |
note |
CLA |
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention |
 |
 |
 |
note |
m |
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark |
|
 |
 |
note |
SCBench |
SCBench: A KV Cache-Centric Analysis of Long-Context Methods |
|
 |
 |
note |
SageAttention2 |
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization |
|
 |
 |
note |
SageAttention |
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration |
|
 |
 |
note |
SampleAttention |
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention |
 |
 |
|
note |
SeerAttention |
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs |
 |
 |
 |
note |
ShadowLLM |
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models |
 |
 |
 |
note |
SnapKV |
SnapKV: LLM Knows What You are Looking for Before Generation |
 |
 |
 |
note |
TOVA |
Transformers are Multi-State RNNs |
 |
 |
 |
note |
Turbo Sparse |
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters |
|
 |
Pytorch |
note |
XGrammar |
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models |
 |
 |
 |
note |
ZigZagKV |
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty |
 |
 |
|
note |
Async-TP |
[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch |
 |
 |
 |
note |
LinearPatch |
A Simple Linear Patch Revives Layer-Pruned Large Language Models |
 |
 |
|
note |
Acc-SpMM |
Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores |
 |
 |
|
note |
07NWF4VE |
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching |
 |
 |
|
note |
SharePrefill |
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing |
 |
 |
|
note |
ACP |
Adaptive Computation Pruning for the Forgetting Transformer |
|
 |
 |
note |
FlexiDepth |
Adaptive Layer-skipping in Pre-trained LLMs |
 |
 |
|
note |
AhaKV |
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models |
|
 |
|
note |
AmberPruner |
Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models |
|
 |
|
note |
AttentionPredictor |
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference |
 |
 |
|
note |
CCQ |
CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs |
|
 |
|
note |
UC0D8DJ6 |
Characterizing Communication Patterns in Distributed Large Language Model Inference |
|
 |
|
note |
1DZIJVBI |
Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications |
 |
 |
|
note |
ChunkKV |
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference |
 |
 |
|
note |
CometSeed |
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts |
 |
 |
 |
note |
DBudgetKV |
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance |
 |
 |
|
note |
DeepSeek-R1 |
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |
 |
 |
 |
note |
DeltaAttention |
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction |
 |
 |
|
note |
DeltaLLM |
DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference |
 |
 |
|
note |
LIMINAL |
Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need |
|
 |
|
note |
RaaS |
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity |
 |
 |
|
note |
topk-decoding |
Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs |
 |
 |
 |
note |
2ZU1IWL6 |
Fast and Simplex: 2-Simplicial Attention in Triton |
|
 |
|
note |
FastKV |
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation |
 |
 |
 |
note |
FlashInfer |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving |
|
 |
 |
note |
FlashOverlap |
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation |
 |
 |
 |
note |
FreqKV |
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension |
 |
 |
|
note |
HATA |
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference |
 |
 |
 |
note |
HCAttention |
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs |
 |
 |
|
note |
GLA |
Hardware-Efficient Attention for Fast Decoding |
 |
 |
 |
note |
HelixParallelism |
Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding |
|
 |
|
note |
Adrenaline |
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation |
 |
 |
 |
note |
IFPruning |
Instruction-Following Pruning for Large Language Models |
 |
 |
|
note |
209M5GA7 |
KV Cache Compression for Inference Efficiency in LLMs: A Review |
|
 |
|
note |
KVLink |
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse |
 |
 |
 |
note |
KeepKV |
KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference |
 |
 |
|
note |
LServer |
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention |
 |
 |
 |
note |
LeanK |
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding |
 |
 |
 |
note |
MIRAGE |
MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving |
|
 |
|
note |
MegaScale-MoE |
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production |
 |
 |
|
note |
MiniCPM4 |
MiniCPM4: Ultra-Efficient LLMs on End Devices |
 |
 |
 |
note |
52A7RO95 |
Mixture of Experts in Large Language Models |
 |
 |
|
note |
MoSA |
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing |
 |
 |
 |
note |
MoR |
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation |
 |
 |
 |
note |
MoBA |
MoBA: Mixture of Block Attention for Long-Context LLMs |
 |
 |
 |
note |
Mosaic |
Mosaic: Composite Projection Pruning for Resource-efficient LLMs |
 |
 |
|
note |
PanguUltra |
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs |
|
 |
|
note |
PowerAttention |
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention |
 |
 |
|
note |
PSA |
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving |
 |
 |
 |
note |
Cus-Prun |
Pruning General Large Language Models into Customized Expert Models |
 |
 |
 |
note |
QuickSilver |
QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization |
|
 |
|
note |
Qwen3 |
Qwen3 Technical Report |
 |
 |
 |
note |
R-KV |
R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration |
 |
 |
 |
note |
RadialAttention |
Radial Attention: Sparse Attention with Energy Decay for Long Video Generation |
|
 |
|
note |
ReSA |
Rectified Sparse Attention |
 |
 |
 |
note |
0VRXJQ3F |
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving |
 |
 |
 |
note |
SALE |
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling |
 |
 |
 |
note |
SEAP |
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models |
 |
 |
 |
note |
SageAttention3 |
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training |
|
 |
 |
note |
SeerAttention-R |
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning |
 |
 |
 |
note |
Seesaw |
Seesaw: High-throughput LLM Inference via Model Re-sharding |
 |
 |
|
note |
Awesome-Efficient-Arch |
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models |
 |
 |
 |
note |
SpindleKV |
SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers |
 |
 |
 |
note |
Step-3 |
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding |
|
 |
|
note |
Task-KV |
Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads |
 |
 |
|
note |
sparse-frontier |
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs |
 |
 |
 |
note |
TileLink |
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives |
 |
 |
 |
note |
TokenWeave |
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference |
 |
 |
 |
note |
Triton-distributed |
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler |
 |
 |
 |
note |
Super-Experts-Profilling |
Unveiling Super Experts in Mixture-of-Experts Large Language Models |
 |
 |
 |
note |