By Institution

institution

AWS AI

Meta	Title	Cover	Publish	Code	Note
MXFP4Train	Training LLMs with MXFP4				note

AWS AI Labs

Meta	Title	Cover	Publish	Code	Note
m	Structural Pruning of Large Language Models via Neural Architecture Search

Adobe Research

Meta	Title	Cover	Publish	Code	Note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead				note

Advanced Micro Devices

Meta	Title	Note
T3	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives	note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization	note

Alibaba Cloud

Meta	Title	Cover	Publish	Code	Note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching				note

Alibaba Group

Meta	Title	Note
SlimGPT	SlimGPT: Layer-wise Structured Pruning for Large Language Models	note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity	note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation	note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models	note

Apple

Meta	Title	Note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory	note
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple	note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	note
IFPruning	Instruction-Following Pruning for Large Language Models	note

Argonne National Laboratory

Meta	Title	Cover	Publish	Code	Note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts				note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference				note

Baichuan Inc

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note

Beihang University

Meta	Title	Cover	Publish	Code	Note
SMP	Pruning Pre-trained Language Models Without Fine-Tuning
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning				note

ByteDance

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference	note
52A7RO95	Mixture of Experts in Large Language Models	note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention	note

ByteDance Seed

Meta	Title	Note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

CMU

Meta	Title	Cover	Publish	Code	Note
massive-activations	Massive Activations in Large Language Models				note

CPII under InnoHK

Meta	Title	Cover	Publish	Code	Note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Carnegie Mellon University

Meta	Title	Note
GBLM-Pruner	Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models	note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
Wanda	A Simple and Effective Pruning Approach for Large Language Models	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	note
Bonsa	Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models	note
TidalDecode	TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note

CentML

Meta	Title	Cover	Publish	Code	Note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding				note

Center for Advanced AI

Meta	Title	Cover	Publish	Code	Note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse				note

Central South University

Meta	Title	Cover	Publish	Code	Note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching				note

Cerebras Systems

Meta	Title	Note
SPDF	SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

Chinese Academy of Sciences

Meta	Title	Cover	Publish	Code	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads				note

Chinese University of Hong Kong

Meta	Title	Cover	Publish	Code	Note
068ZPAME	A Survey on Inference Optimization Techniques for Mixture of Experts Models				note

Chongqing University

Meta	Title	Cover	Publish	Code	Note
GBDT	Pruning Large Language Models via Accuracy Predictor

City University of Hong Kong

Meta	Title	Cover	Publish	Code	Note
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts				note
CHESS	Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification			Pytorch	note

Cohere

Meta	Title	Cover	Publish	Code	Note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation				note
sparse-frontier	The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs				note

Comenius University

Meta	Title	Cover	Publish	Code	Note
ADMM-pruning	Fast and Effective Weight Update for Pruned Large Language Models				note

Computer Network Information Center, Chinese Academy of Sciences

Meta	Title	Cover	Publish	Code	Note
Acc-SpMM	Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores				note

Cornell University

Meta	Title	Note
Movement Pruning	Movement Pruning: Adaptive Sparsity by Fine-Tuning
QuIP	QuIP: Quantization with Incoherence Processing
Recycled Attention	Recycled Attention: Efficient inference for long-context language models	note
ShadowLLM	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models	note
MXFP4Train	Training LLMs with MXFP4	note

DENSO IT Lab

Meta	Title	Cover	Publish	Code	Note
SAS	SAS: Structured Activation Spasification				note

DeepAuto.ai

Meta	Title	Cover	Publish	Code	Note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction				note

DeepMind

Meta	Title	Cover	Publish	Code	Note
m	Fast Sparse ConvNets

DeepSeek

Meta	Title	Cover	Publish	Code	Note
DeepEP	DeepEP: an efficient expert-parallel communication library				note

DeepSeek-AI

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

DeepSpeed

Meta	Title	Cover	Publish	Code	Note
Domino	Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping				note

Delft University of Technology

Meta	Title	Cover	Publish	Code	Note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference				note

Duke University

Meta	Title	Cover	Publish	Code	Note
CoreInfer	CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation				note

ETH Zurich

Meta	Title	Note
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores	note
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference	note

Eindhoven University of Technology

Meta	Title	Cover	Publish	Code	Note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Emory University

Meta	Title	Cover	Publish	Code	Note
SparseLLM	SparseLLM: Towards Global Pruning for Pre-trained Language Models				note

FAIR

Meta	Title	Cover	Publish	Code	Note
TOVA	Transformers are Multi-State RNNs				note

Fairleigh Dickinson University

Meta	Title	Cover	Publish	Code	Note
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving				note

Fudan University

Meta	Title	Note
MFA	Multi-matrix Factorization Attention	note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention	note

Gaoling School of Artificial Intelligence, Renmin University of China

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Model Compression for Large Language Models

Georgia Institute of Technology

Meta	Title	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	note
1DZIJVBI	Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications	note

Georgia Tech

Meta	Title	Cover	Publish	Code	Note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM				note

Google

Meta	Title	Note
Transformer	Attention Is All You Need	note
m	Fast Sparse ConvNets
Sprint	Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation
Dist-Einsum	Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models	note
m	The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
KCM	Gradient-Free Structured Pruning with Unlabeled Data
ShadowLLM	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models	note

Google Cloud

Meta	Title	Cover	Publish	Code	Note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Google DeepMind

Meta	Title	Note
MoD	Mixture-of-Depths: Dynamically allocating compute in transformer-based language models	note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA	note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation	note

Google Research

Meta	Title	Note
FrameQuant	FrameQuant: Flexible Low-Bit Quantization for Transformers	note
OSSCAR	OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Bonsa	Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA	note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation	note

Graphcore Research

Meta	Title	Cover	Publish	Code	Note
SparQ	SparQ Attention: Bandwidth-Efficient LLM Inference				note

HKUST

Meta	Title	Cover	Publish	Code	Note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models				note

Habana Labs

Meta	Title	Cover	Publish	Code	Note
MVUE	Minimum Variance Unbiased N:M Sparsity for the Neural Gradients

Harbin Institute of Technology

Meta	Title	Note
TextPruner	TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
GRAIN	Gradient-based Intra-attention Pruning on Pre-trained Language Models	note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty	note

Harvard University

Meta	Title	Cover	Publish	Code	Note
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts				note

Heriot-Watt University

Meta	Title	Cover	Publish	Code	Note
52A7RO95	Mixture of Experts in Large Language Models				note

Hong Kong University of Science and Technology

Meta	Title	Cover	Publish	Code	Note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management				note
SlimInfer	SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning				note

Houmo AI

Meta	Title	Cover	Publish	Code	Note
RPTQ	RPTQ: Reorder-based Post-training Quantization for Large Language Models				note

Huawei

Meta	Title	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	note
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	note
AmberPruner	Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models	note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs	note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	note

Huawei Cloud

Meta	Title	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity	note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation	note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	note

Huawei Noah's Ark Lab

Meta	Title	Note
SIMPLE	Structured Pruning for Efficient Generative Pre-trained Language Models	note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
ReAttention	ReAttention: Training-Free Infinite Context with Finite Attention Scope	note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models	note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models	note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension	note

Huawei Technologies

Meta	Title	Cover	Publish	Code	Note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs				note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference				note

Huazhong University of Science and Technology

Meta	Title	Cover	Publish	Code	Note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Hugging Face

Meta	Title	Cover	Publish	Code	Note
Movement Pruning	Movement Pruning: Adaptive Sparsity by Fine-Tuning

IST

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey

IST Austria

Meta	Title	Note
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
SPDY	SPDY: Accurate Pruning with Speedup Guarantees	note
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

Illinois Institute of Technology

Meta	Title	Cover	Publish	Code	Note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts				note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference				note

Imperial College London

Meta	Title	Cover	Publish	Code	Note
52A7RO95	Mixture of Experts in Large Language Models				note

Indian Institute of Science

Meta	Title	Cover	Publish	Code	Note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention				note

Infinigence-AI

Meta	Title	Note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting	note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	note

Institute for Advanced Algorithms Research

Meta	Title	Cover	Publish	Code	Note
SEAP	SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models				note

Institute of Automation

Meta	Title	Cover	Publish	Code	Note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference				note

Institute of Automation, Chinese Academy of Sciences

Meta	Title	Note
FLAP	Fluctuation-based Adaptive Structured Pruning for Large Language Models
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

Institute of Computing Technology

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation				note

Institute of Computing Technology, Chinese Academy of Sciences

Meta	Title	Cover	Publish	Code	Note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models				note

Institute of Information Engineering, Chinese Academy of Sciences

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Model Compression for Large Language Models

Intel

Meta	Title	Cover	Publish	Code	Note
SCAP	Post-Training Statistical Calibration for Higher Activation Sparsity				note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM				note

Intel Corporation

Meta	Title	Cover	Publish	Code	Note
OpenVINO	Post-training deep neural network pruning via layer-wise calibration

Intellifusion Inc.

Meta	Title	Cover	Publish	Code	Note
HCAttention	HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs				note

KAIST

Meta	Title	Note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization	note
DeltaAttention	Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction	note
UC0D8DJ6	Characterizing Communication Patterns in Distributed Large Language Model Inference	note
1DZIJVBI	Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications	note

KAIST AI

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

KAUST

Meta	Title	Cover	Publish	Code	Note
DistGEMM	A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems				note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing				note

KTH Royal Institute of Technology

Meta	Title	Cover	Publish	Code	Note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models				note

Key Laboratory of Multimedia Trusted Perception and Efficient Computing

Meta	Title	Cover	Publish	Code	Note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs				note

Kyushu University

Meta	Title	Cover	Publish	Code	Note
SharedAttention	Beyond KV Caching: Shared Attention for Efficient LLMs				note

Lanzhou University

Meta	Title	Cover	Publish	Code	Note
AVSS	AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis				note

Leiden University

Meta	Title	Cover	Publish	Code	Note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference				note

Lingnan University

Meta	Title	Cover	Publish	Code	Note
WGM	Binary Quantization For LLMs Through Dynamic Grouping				note

MBZUAI

Meta	Title	Cover	Publish	Code	Note
CHESS	Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification			Pytorch	note

MIT

Meta	Title	Code	Note
SparseViT	SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer		note
TorchSparse++	TorchSparse++: Efficient Point Cloud Engine
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference		note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
CLA	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note

MIT-IBM Watson AI Lab

Meta	Title	Cover	Publish	Code	Note
CLA	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention				note

MakerMaker AI

Meta	Title	Cover	Publish	Code	Note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate				note
ACP	Adaptive Computation Pruning for the Forgetting Transformer				note

Massachusetts Institute of Technology

Meta	Title	Note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
OSSCAR	OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization	note
TEAL	Training-Free Activation Sparsity in Large Language Models	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note

Megvii Technology

Meta	Title	Cover	Publish	Code	Note
MFA	Multi-matrix Factorization Attention				note

Meituan

Meta	Title	Cover	Publish	Code	Note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models				note

Meta AI

Meta	Title	Note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	note

Meta AI (FAIR)

Meta	Title	Cover	Publish	Code	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models				note

Meta Platforms Inc

Meta	Title	Cover	Publish	Code	Note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization				note

Michigan State University

Meta	Title	Cover	Publish	Code	Note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark				note

Microsoft

Meta	Title	Note
LoRA	LoRA: Low-rank adaptation of large language models
ZeroQuant	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models	note
LoRAShear	LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
ZeroQuant-V2	ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
ChunkAttention	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	note
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	note
Splitwise	Splitwise: Efficient generative LLM inference using phase splitting	note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving	note
Domino	Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note

Microsoft Azure

Meta	Title	Cover	Publish	Code	Note
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models				note

Microsoft Azure AI

Meta	Title	Cover	Publish	Code	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Microsoft Research

Meta	Title	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads	note
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
m	A Survey on Evaluation of Large Language Models
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated	note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note
ReSA	Rectified Sparse Attention	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

Microsoft Research India

Meta	Title	Cover	Publish	Code	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference				note
TokenWeave	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference				note

Mila & Universite de Montreal

Meta	Title	Cover	Publish	Code	Note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate				note
ACP	Adaptive Computation Pruning for the Forgetting Transformer				note

MiniCPM

Meta	Title	Cover	Publish	Code	Note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

MiniMax

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Ministry of Education of China, Xiamen University

Meta	Title	Cover	Publish	Code	Note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs				note

Mohamed bin Zayed University of AI

Meta	Title	Cover	Publish	Code	Note
GBLM-Pruner	Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models				note

Monash University

Meta	Title	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note

Moonshot AI

Meta	Title	Cover	Publish	Code	Note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs				note

Multimedia Laboratory (MMLab)

Meta	Title	Cover	Publish	Code	Note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Murdoch University

Meta	Title	Cover	Publish	Code	Note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning				note

NAVER Cloud

Meta	Title	Cover	Publish	Code	Note
MiKV	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization				note

NVIDIA

Meta	Title	Code	Note
m	Channel Permutations for N:M Sparsity
NMSparse	Accelerating Sparse Deep Neural Networks
FT	FasterTransformer
DistGEMM	A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems		note
streaming-llm	Efficient Streaming Language Models with Attention Sinks		note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models		note
Minitron	Compact Language Models via Pruning and Knowledge Distillation		note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models		note
StarAttention	Star Attention: Efficient LLM Inference over Long Sequences		note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring		note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving		note
HelixParallelism	Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note
kvpress	kvpress: LLM KV cache compression made easy		note

NVIDIA Research

Meta	Title	Cover	Publish	Code	Note
LIMINAL	Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need				note

NanKai University

Meta	Title	Cover	Publish	Code	Note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads				note

Nanjing University

Meta	Title	Note
STA	An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity	note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation	note

Nanyang Technological University

Meta	Title	Note
L-OBS	Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models	note

National University of Singapore

Meta	Title	Note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models	note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models	note

Neural Magic

Meta	Title	Note
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
SPDY	SPDY: Accurate Pruning with Speedup Guarantees	note
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

New York University

Meta	Title	Cover	Publish	Code	Note
Recycled Attention	Recycled Attention: Efficient inference for long-context language models				note
QJL	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead				note

Noah’s Ark Lab, Huawei Technologies

Meta	Title	Cover	Publish	Code	Note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference				note

Normal Computing

Meta	Title	Cover	Publish	Code	Note
m	Efficient Guided Generation for Large Language Models				note

North China Electric Power University

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note

Northeastern University

Meta	Title	Cover	Publish	Code	Note
ADMM-pruning	A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers

Northwestern University

Meta	Title	Cover	Publish	Code	Note
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Numenta

Meta	Title	Cover	Publish	Code	Note
Complementary Sparsity	Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks				note

OPPO Research Institute

Meta	Title	Cover	Publish	Code	Note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing				note

Ohio State University

Meta	Title	Cover	Publish	Code	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads				note
CoreInfer	CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation				note

OpenAI

Meta	Title	Cover	Publish	Code	Note
blocksparse	GPU Kernels for Block-Sparse Weights

OpenGVLab

Meta	Title	Cover	Publish	Code	Note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

OpenNLPLab

Meta	Title	Cover	Publish	Code	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention				note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models				note

OpenTeams Inc

Meta	Title	Cover	Publish	Code	Note
TorchAO	TorchAO: PyTorch-Native Training-to-Serving Model Optimization				note

Oxford University

Meta	Title	Cover	Publish	Code	Note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models				note

Peking University

Meta	Title	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	note
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty	note
m	A Survey on Efficient Inference for Large Language Models	note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference	note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity	note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference	note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

Perplexity AI

Meta	Title	Cover	Publish	Code	Note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Princeton University

Meta	Title	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
MeZO	Fine-Tuning Language Models with Just Forward Passes	note
LLM-shearing	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation	note
TEAL	Training-Free Activation Sparsity in Large Language Models	note
GLA	Hardware-Efficient Attention for Fast Decoding	note

Purdue University

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note
52A7RO95	Mixture of Experts in Large Language Models				note

PyTorch

Meta	Title	Cover	Publish	Code	Note
Async-TP	[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch				note

Qwen Team

Meta	Title	Cover	Publish	Code	Note
Qwen3	Qwen3 Technical Report				note

RMIT University

Meta	Title	Cover	Publish	Code	Note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning				note

RWTH Aachen University

Meta	Title	Cover	Publish	Code	Note
FasterVGGT	Faster VGGT with Block-Sparse Global Attention				note

Renmin University of China

Meta	Title	Cover	Publish	Code	Note
Acc-SpMM	Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores				note
SEAP	SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models				note

Rice University

Meta	Title	Cover	Publish	Code	Note
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache				note

RiseAI-Sys

Meta	Title	Cover	Publish	Code	Note
attention-gym	Attention-Gym: Triton-Based Sparse and Quantization Attention				note

SJTU

Meta	Title	Cover	Publish	Code	Note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring				note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention				note

Salesforce AI Research

Meta	Title	Cover	Publish	Code	Note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Salesforce Research

Meta	Title	Cover	Publish	Code	Note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs				note

Samsung

Meta	Title	Cover	Publish	Code	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers				note

Samsung AI Center

Meta	Title	Cover	Publish	Code	Note
TinyTrain	TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge				note

Santa Clara University

Meta	Title	Cover	Publish	Code	Note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference				note

School of Cyber Security, University of Chinese Academy of Sciences

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Model Compression for Large Language Models

SenseTime

Meta	Title	Cover	Publish	Code	Note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving				note

SenseTime Research

Meta	Title	Cover	Publish	Code	Note
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Seoul National University

Meta	Title	Note
K-pruning	Knowledge-preserving Pruning for Pre-trained Language Models without Retraining	note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ	note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation	note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation	note

Shanghai AI Lab

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note

Shanghai AI Laboratory

Meta	Title	Note
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

Shanghai Artificial Intelligence Laboratory

Meta	Title	Cover	Publish	Code	Note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters			Pytorch	note

Shanghai Artificial Intelligence Laboratorys

Meta	Title	Cover	Publish	Code	Note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Shanghai Jiao Tong University

Meta	Title	Code	Note
PINS	Pruning Pre-trained Language Models with Principled Importance and Self-regularization
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption		note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference		note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models		note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs		note
068ZPAME	A Survey on Inference Optimization Techniques for Mixture of Experts Models		note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache		note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression		note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models		note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference		note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration		note
PoD	Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity		note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting		note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts		note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension		note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation		note

Shanghai Jiaotong University

Meta	Title	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
m	A Survey on Efficient Inference for Large Language Models	note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing	note

ShanghaiTech University

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note

Shenzhen Institutes of Advanced Technology(SIAT), Chinese Academy of Science(CAS)

Meta	Title	Cover	Publish	Code	Note
AMALI	AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs				note

Singapore University of Technology and Design

Meta	Title	Cover	Publish	Code	Note
Cus-Prun	Pruning General Large Language Models into Customized Expert Models				note

Sogang University

Meta	Title	Cover	Publish	Code	Note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference				note

Soochow University

Meta	Title	Cover	Publish	Code	Note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models				note

Southwest University

Meta	Title	Cover	Publish	Code	Note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning				note

Stanford

Meta	Title	Cover	Publish	Code	Note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity				note

Stanford University

Meta	Title	Note
Deep Compression	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
DSD	DSD: Dense-Sparse-Dense Training for Deep Neural Networks
FlashAttention	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
Flash-Decoding	Flash-Decoding for long-context inference	note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models	note
FlashAttention-2	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
MoSA	Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing	note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding	note

StepFun

Meta	Title	Cover	Publish	Code	Note
MFA	Multi-matrix Factorization Attention				note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding				note

Stepfun

Meta	Title	Cover	Publish	Code	Note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation				note

Stevens Institute of Technology

Meta	Title	Cover	Publish	Code	Note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache				note

Sun Yat-sen University

Meta	Title	Cover	Publish	Code	Note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation				note

Sungkyunkwan University

Meta	Title	Cover	Publish	Code	Note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ				note

Synthesia

Meta	Title	Cover	Publish	Code	Note
SparQ	SparQ Attention: Bandwidth-Efficient LLM Inference				note

Tencent AI Lab

Meta	Title	Cover	Publish	Code	Note
RPTQ	RPTQ: Reorder-based Post-training Quantization for Large Language Models				note

Tencent Machine Learning Platform

Meta	Title	Cover	Publish	Code	Note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models				note

Tencent Youtu Lab

Meta	Title	Cover	Publish	Code	Note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs				note

Texas A&M University

Meta	Title	Cover	Publish	Code	Note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache				note

The Chinese University of Hong Kong

Meta	Title	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models	note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

The Hebrew University of Jerusalem

Meta	Title	Cover	Publish	Code	Note
TOVA	Transformers are Multi-State RNNs				note

The Hebrew University of Jerusalem, Israel

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey

The Hong Kong Polytechnic University

Meta	Title	Cover	Publish	Code	Note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management				note
WGM	Binary Quantization For LLMs Through Dynamic Grouping				note

The Hong Kong University of Science and Technology

Meta	Title	Note
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	note
WGM	Binary Quantization For LLMs Through Dynamic Grouping	note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference	note
FSA	Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel	note

The Ohio State University

Meta	Title	Cover	Publish	Code	Note
UC0D8DJ6	Characterizing Communication Patterns in Distributed Large Language Model Inference				note

The University of Adelaide

Meta	Title	Cover	Publish	Code	Note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification				note

The University of Hong Kong

Meta	Title	Note
SIMPLE	Structured Pruning for Efficient Generative Pre-trained Language Models	note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference	note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention	note
ReSA	Rectified Sparse Attention	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

The University of North Carolina

Meta	Title	Cover	Publish	Code	Note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark				note

The University of Texas at Austin

Meta	Title	Code	Note
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving	note	None
MIRAGE	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving	note	None
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Recycled Attention	Recycled Attention: Efficient inference for long-context language models		note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference		note

Together AI

Meta	Title	Cover	Publish	Code	Note
TEAL	Training-Free Activation Sparsity in Large Language Models				note

Tongji University

Meta	Title	Cover	Publish	Code	Note
GBDT	Pruning Large Language Models via Accuracy Predictor
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization				note

Tsinghua University

Meta	Title	Code	Note
Deep Compression	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X		note
m	Training Transformers with 4-bit Integers
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens		note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding		note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
m	Accelerating Transformer Pre-training with 2:4 Sparsity		note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
m	A Survey on Efficient Inference for Large Language Models		note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs		note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models		note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression		note
MFA	Multi-matrix Factorization Attention		note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
ReMoE	ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing		note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization		note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration		note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training		note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity		note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs		note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference		note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity		note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring		note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs		note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training		note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput		note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models		note
DReSS	DReSS: Data-driven Regularized Structured Streamlining for Large Language Models		note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models		note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference		note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding		note
ReSA	Rectified Sparse Attention		note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations		note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning		note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models		note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation		note

UC Berkeley

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
LoRA+	LoRA+: Efficient Low Rank Adaptation of Large Models	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note

UC Santa Barbara

Meta	Title	Note
FlexiDepth	Adaptive Layer-skipping in Pre-trained LLMs	note
IFPruning	Instruction-Following Pruning for Large Language Models	note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse	note

UCSD

Meta	Title	Cover	Publish	Code	Note
GPFQ	A Greedy Algorithm for Quantizing Neural Networks
GPFQv2	Post-training Quantization for Neural Networks with Provable Guarantees

Univeristy of Sydney

Meta	Title	Cover	Publish	Code	Note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity				note

Universidade da Coruña

Meta	Title	Cover	Publish	Code	Note
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores				note

Universidade de Lisboa

Meta	Title	Cover	Publish	Code	Note
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention				note

University College London

Meta	Title	Cover	Publish	Code	Note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models				note

University of Basel

Meta	Title	Cover	Publish	Code	Note
Adaptively Sparse Attention	Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

University of California

Meta	Title	Cover	Publish	Code	Note
DSA	Transformer Acceleration with Dynamic Sparse Attention				note

University of California, Berkeley

Meta	Title	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note

University of California, Riverside

Meta	Title	Cover	Publish	Code	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads				note

University of California, San Diego

Meta	Title	Cover	Publish	Code	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models				note

University of Cambridge, United Kingdom

Meta	Title	Cover	Publish	Code	Note
TinyTrain	TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge				note

University of Chinese Academy of Sciences

Meta	Title	Note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated	note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving	note
EvolKV	EvolKV: Evolutionary KV Cache Compression for LLM Inference	note

University of Connecticut

Meta	Title	Cover	Publish	Code	Note
m	Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

University of Edinburgh

Meta	Title	Cover	Publish	Code	Note
sparse-frontier	The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs				note

University of Electronic Science and Technology of China

Meta	Title	Cover	Publish	Code	Note
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

University of Hong Kong

Meta	Title	Cover	Publish	Code	Note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs				note

University of Illinois Urbana-Champaign

Meta	Title	Cover	Publish	Code	Note
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning				note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation				note

University of Illinois at Urbana-Champaign

Meta	Title	Cover	Publish	Code	Note
MiniKV	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache				note

University of Macau

Meta	Title	Cover	Publish	Code	Note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models				note

University of Maryland

Meta	Title	Cover	Publish	Code	Note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM				note
topk-decoding	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs				note

University of Massachusetts Amherst

Meta	Title	Cover	Publish	Code	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads				note

University of Oxford

Meta	Title	Cover	Publish	Code	Note
SMAT	Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts				note

University of Science and Technology

Meta	Title	Cover	Publish	Code	Note
AMALI	AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs				note

University of Science and Technology of China

Meta	Title	Note
AdaKV	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference	note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference	note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference	note

University of Seoul

Meta	Title	Cover	Publish	Code	Note
SparseInfer	SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference				note

University of Southern California

Meta	Title	Cover	Publish	Code	Note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving				note

University of St Andrews

Meta	Title	Cover	Publish	Code	Note
Mosaic	Mosaic: Composite Projection Pruning for Resource-efficient LLMs				note

University of Surrey

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note

University of Surrey, UK

Meta	Title	Cover	Publish	Code	Note
Selective Context	Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering

University of Technology Sydney

Meta	Title	Cover	Publish	Code	Note
TOA	Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning				note

University of Texas at Austin

Meta	Title	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

University of Toronto

Meta	Title	Cover	Publish	Code	Note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding				note

University of Washington

Meta	Title	Note
QLoRA	QLoRA: Efficient Finetuning of Quantized LLMs
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Splitwise	Splitwise: Efficient generative LLM inference using phase splitting	note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note

University of Waterloo

Meta	Title	Cover	Publish	Code	Note
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty				note

University of Wisconsin

Meta	Title	Cover	Publish	Code	Note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration				note

University of Wisconsin-Madison

Meta	Title	Cover	Publish	Code	Note
T3	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives				note
FrameQuant	FrameQuant: Flexible Low-Bit Quantization for Transformers				note

VITA Group

Meta	Title	Cover	Publish	Code	Note
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Vector Institute

Meta	Title	Cover	Publish	Code	Note
EAGLE	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty				note
Seesaw	Seesaw: High-throughput LLM Inference via Model Re-sharding				note

Vizuara AI Labs

Meta	Title	Cover	Publish	Code	Note
MoE-MLA-RoPE	Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models				note

Vokram Group

Meta	Title	Cover	Publish	Code	Note
52A7RO95	Mixture of Experts in Large Language Models				note

WeChat AI

Meta	Title	Cover	Publish	Code	Note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance				note

Wuhan University

Meta	Title	Code	Note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption		note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models		note
CHESS	Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification	Pytorch	note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers		note

Xi'an Jiaotong University

Meta	Title	Cover	Publish	Code	Note
KVmix	KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache				note

Xiamen University

Meta	Title	Cover	Publish	Code	Note
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models				note

Xiaohongshu

Meta	Title	Cover	Publish	Code	Note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty				note

Xiaomi

Meta	Title	Cover	Publish	Code	Note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers				note

Yale University

Meta	Title	Cover	Publish	Code	Note
Diffuser	Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Zhe Jiang University

Meta	Title	Cover	Publish	Code	Note
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Zhejiang University

Meta	Title	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
SharePrefill	Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing	note

Zhipu AI

Meta	Title	Cover	Publish	Code	Note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Zhipu.AI

Meta	Title	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note

Zhongguancun Laboratory

Meta	Title	Cover	Publish	Code	Note
SMP	Pruning Pre-trained Language Models Without Fine-Tuning

baidu

Meta	Title	Cover	Publish	Code	Note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention				note

iFLYTEK Research

Meta	Title	Cover	Publish	Code	Note
GRAIN	Gradient-based Intra-attention Pruning on Pre-trained Language Models				note

inst1

Meta	Title	Note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note
YS9YTT55	LLM Inference Serving: Survey of Recent Advances and Opportunities	note
CacheBlend	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models	note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs	note
209M5GA7	KV Cache Compression for Inference Efficiency in LLMs: A Review	note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

inst2

Meta	Title	Note
Flash-Decoding	Flash-Decoding for long-context inference	note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note
YS9YTT55	LLM Inference Serving: Survey of Recent Advances and Opportunities	note
CacheBlend	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	note
AhaKV	AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models	note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs	note
209M5GA7	KV Cache Compression for Inference Efficiency in LLMs: A Review	note
QuickSilver	QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note