By Publication

publication

AAAI

Meta	Title	Note
Diffuser	Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences
FLAP	Fluctuation-based Adaptive Structured Pruning for Large Language Models
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead	note

ACL

Meta	Title	Note
m	Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
TextPruner	TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
GRAIN	Gradient-based Intra-attention Pruning on Pre-trained Language Models	note
SMP	Pruning Pre-trained Language Models Without Fine-Tuning
PINS	Pruning Pre-trained Language Models with Principled Importance and Self-regularization
SIMPLE	Structured Pruning for Efficient Generative Pre-trained Language Models	note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens	note
ChunkAttention	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note

AISTATS

Meta	Title	Cover	Publish	Code	Note
MXFP4Train	Training LLMs with MXFP4				note

ASPLOS

Meta	Title	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads	note
Dist-Einsum	Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models	note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	note
T3	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives	note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note

ATC

Meta	Title	Cover	Publish	Code	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention				note

AutoML Workshop

Meta	Title	Cover	Publish	Code	Note
m	Structural Pruning of Large Language Models via Neural Architecture Search

Blog

Meta	Title	Note
m	Creating Sparse GPT-3 Models with Iterative Pruning
DistGEMM	A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems	note
Async-TP	[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch	note

COLM

Meta	Title	Note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models	note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs	note

CVPR

Meta	Title	Cover	Publish	Code	Note
m	Fast Sparse ConvNets
SparseViT	SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer				note

CVPR workshop

Meta	Title	Cover	Publish	Code	Note
TorchSparse++	TorchSparse++: Efficient Point Cloud Engine

Coling

Meta	Title	Cover	Publish	Code	Note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism				note

DATE

Meta	Title	Cover	Publish	Code	Note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference				note

ECCV

Meta	Title	Cover	Publish	Code	Note
ADMM-pruning	A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers

EMNLP Findings

Meta	Title	Cover	Publish	Code	Note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning				note

ENLSP

Meta	Title	Cover	Publish	Code	Note
SCAP	Post-Training Statistical Calibration for Higher Activation Sparsity				note

EuroSys

Meta	Title	Cover	Publish	Code	Note
CacheBlend	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion				note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation				note

ICCV

Meta	Title	Cover	Publish	Code	Note
OpenVINO	Post-training deep neural network pruning via layer-wise calibration

ICLR

Meta	Title	Note
Deep Compression	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
DSD	DSD: Dense-Sparse-Dense Training for Deep Neural Networks
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
LoRA	LoRA: Low-rank adaptation of large language models
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
MVUE	Minimum Variance Unbiased N:M Sparsity for the Neural Gradients
m	The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Wanda	A Simple and Effective Pruning Approach for Large Language Models	note
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple	note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
FlashAttention-2	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	note
SAS	SAS: Structured Activation Spasification	note
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate	note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope	note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA	note
TidalDecode	TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention	note
TEAL	Training-Free Activation Sparsity in Large Language Models	note

ICLR oral

Meta	Title	Cover	Publish	Code	Note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

ICML

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
SPDY	SPDY: Accurate Pruning with Speedup Guarantees	note
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
m	Accelerating Transformer Pre-training with 2:4 Sparsity	note
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	note
FrameQuant	FrameQuant: Flexible Low-Bit Quantization for Transformers	note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	note
LoRA+	LoRA+: Efficient Low Rank Adaptation of Large Models	note
OSSCAR	OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models	note
SparQ	SparQ Attention: Bandwidth-Efficient LLM Inference	note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models	note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
TinyTrain	TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge	note
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts	note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note
StarAttention	Star Attention: Efficient LLM Inference over Long Sequences	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note

ICML Workshop

Meta	Title	Cover	Publish	Code	Note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization				note

ISCA

Meta	Title	Note
Splitwise	Splitwise: Efficient generative LLM inference using phase splitting	note
AMALI	AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs	note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting	note

JMLR

Meta	Title	Cover	Publish	Code	Note
GPFQ	A Greedy Algorithm for Quantizing Neural Networks

KDD Workshop

Meta	Title	Cover	Publish	Code	Note
MoE-MLA-RoPE	Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models				note

MICRO

Meta	Title	Cover	Publish	Code	Note
Sprint	Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

MLSys

Meta	Title	Note
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note

NeurIPS

Meta	Title	Note
OBD	Optimal Brain Damage
Transformer	Attention Is All You Need	note
L-OBS	Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
Movement Pruning	Movement Pruning: Adaptive Sparsity by Fine-Tuning
m	Channel Permutations for N:M Sparsity
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
FlashAttention	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
ZeroQuant	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
SlimGPT	SlimGPT: Layer-wise Structured Pruning for Large Language Models	note
SparseLLM	SparseLLM: Towards Global Pruning for Pre-trained Language Models	note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction	note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization	note

Neuromorphic Computing and Engineering

Meta	Title	Cover	Publish	Code	Note
Complementary Sparsity	Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks				note

OSDI

Meta	Title	Cover	Publish	Code	Note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput				note

PPoPP

Meta	Title	Cover	Publish	Code	Note
Acc-SpMM	Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores				note

SC

Meta	Title	Cover	Publish	Code	Note
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores				note

SIGMOD

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note

SOSP

Meta	Title	Cover	Publish	Code	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention				note

TACL

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey

TC

Meta	Title	Cover	Publish	Code	Note
DSA	Transformer Acceleration with Dynamic Sparse Attention				note

TMLR

Meta	Title	Cover	Publish	Code	Note
ADMM-pruning	Fast and Effective Weight Update for Pruned Large Language Models				note

UAI

Meta	Title	Cover	Publish	Code	Note
SPDF	SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

VLDB

Meta	Title	Cover	Publish	Code	Note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity				note

VLSI

Meta	Title	Cover	Publish	Code	Note
STA	An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

arXiv

Meta	Title	Code	Note
blocksparse	GPU Kernels for Block-Sparse Weights
NMSparse	Accelerating Sparse Deep Neural Networks
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
m	A Survey on Evaluation of Large Language Models
m	A Survey on Model Compression for Large Language Models
GBLM-Pruner	Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models		note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X		note
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models		note
Adaptively Sparse Attention	Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
m	Efficient Guided Generation for Large Language Models		note
MeZO	Fine-Tuning Language Models with Just Forward Passes		note
Flash-Decoding	Flash-Decoding for long-context inference		note
KCM	Gradient-Free Structured Pruning with Unlabeled Data
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models		note
K-pruning	Knowledge-preserving Pruning for Pre-trained Language Models without Retraining		note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory		note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models		note
LoRAShear	LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models		note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
GPFQv2	Post-training Quantization for Neural Networks with Provable Guarantees
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
GBDT	Pruning Large Language Models via Accuracy Predictor
QLoRA	QLoRA: Efficient Finetuning of Quantized LLMs
QuIP	QuIP: Quantization with Incoherence Processing
RPTQ	RPTQ: Reorder-based Post-training Quantization for Large Language Models		note
LLM-shearing	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning		note
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
SMS	Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
m	Training Transformers with 4-bit Integers
Selective Context	Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
ZeroQuant-V2	ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
m	A Survey on Efficient Inference for Large Language Models		note
068ZPAME	A Survey on Inference Optimization Techniques for Mixture of Experts Models		note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management		note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving		note
AVSS	AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis		note
AdaKV	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference		note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs		note
SharedAttention	Beyond KV Caching: Shared Attention for Efficient LLMs		note
Minitron	Compact Language Models via Pruning and Knowledge Distillation		note
CoreInfer	CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation		note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model		note
DeepSeek-V3	DeepSeek-V3 Technical Report		note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models		note
Domino	Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping		note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment		note
Bonsa	Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion		note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention		note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM		note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache		note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ		note
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning		note
YS9YTT55	LLM Inference Serving: Survey of Recent Advances and Opportunities		note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference		note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models		note
massive-activations	Massive Activations in Large Language Models		note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models		note
MiniKV	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache		note
MoD	Mixture-of-Depths: Dynamically allocating compute in transformer-based language models		note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression		note
MFA	Multi-matrix Factorization Attention		note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization		note
CHESS	Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification	Pytorch	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization		note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
ReMoE	ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing		note
Recycled Attention	Recycled Attention: Efficient inference for long-context language models		note
CLA	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention		note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark		note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods		note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization		note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration		note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention		note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs		note
ShadowLLM	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models		note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation		note
TOVA	Transformers are Multi-State RNNs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models		note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty		note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification		note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models		note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching		note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing		note
ACP	Adaptive Computation Pruning for the Forgetting Transformer		note
FlexiDepth	Adaptive Layer-skipping in Pre-trained LLMs		note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models		note
AmberPruner	Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models		note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference		note
WGM	Binary Quantization For LLMs Through Dynamic Grouping		note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs		note
UC0D8DJ6	Characterizing Communication Patterns in Distributed Large Language Model Inference		note
1DZIJVBI	Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications		note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference		note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts		note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance		note
DReSS	DReSS: Data-driven Regularized Structured Streamlining for Large Language Models		note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning		note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference		note
LIMINAL	Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need		note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity		note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference		note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs		note
2ZU1IWL6	Fast and Simplex: 2-Simplicial Attention in Triton		note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation		note
FasterVGGT	Faster VGGT with Block-Sparse Global Attention		note
FSA	Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel		note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving		note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension		note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models		note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference		note
HCAttention	HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs		note
GLA	Hardware-Efficient Attention for Fast Decoding		note
HelixParallelism	Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding		note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation		note
IFPruning	Instruction-Following Pruning for Large Language Models		note
209M5GA7	KV Cache Compression for Inference Efficiency in LLMs: A Review		note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse		note
KVmix	KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache		note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference		note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding		note
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving		note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production		note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices		note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention		note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention		note
52A7RO95	Mixture of Experts in Large Language Models		note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing		note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation		note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts		note
Mosaic	Mosaic: Composite Projection Pruning for Resource-efficient LLMs		note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference		note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs		note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention		note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving		note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models		note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization		note
Qwen3	Qwen3 Technical Report		note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration		note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation		note
ReSA	Rectified Sparse Attention		note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation		note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations		note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling		note
SEAP	SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models		note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning		note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding		note
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning		note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models		note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers		note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding		note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads		note
sparse-frontier	The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs		note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives		note
TokenWeave	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference		note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler		note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models		note

github

Meta	Title	Note
FT	FasterTransformer
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention	note
DeepEP	DeepEP: an efficient expert-parallel communication library	note
KVCache-Factory	Unified KV Cache Compression Methods for Auto-Regressive Models	note
kvpress	kvpress: LLM KV cache compression made easy	note