By Author

author

Aaron Courville

Meta	Title	Note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate	note
ACP	Adaptive Computation Pruning for the Forgetting Transformer	note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation	note

Abhay Gupta

Meta	Title	Cover	Publish	Code	Note
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency				note
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment				note

Adam Fisch

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Aixin Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ajay Jaiswal

Meta	Title	Cover	Publish	Code	Note
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple				note

Amir Gholami

Meta	Title	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note

Amir H. Abdi

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note

André F. T. Martins

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention				note

Aohan Zeng

Meta	Title	Cover	Publish	Code	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Aojun Zhou

Meta	Title	Note
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
STA	An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models	note

Aonian Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Arvind Krishnamurthy

Meta	Title	Cover	Publish	Code	Note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput				note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Ashish Panwar

Meta	Title	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note

Bairu Hou

Meta	Title	Cover	Publish	Code	Note
IFPruning	Instruction-Following Pruning for Large Language Models				note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse				note

Bangwei Gong

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Baris Kasikci

Meta	Title	Note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput	note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	note

Bei Feng

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Beidi Chen

Meta	Title	Note
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
KIVI	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note

Bin Cui

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling				note

Bin Gao

Meta	Title	Cover	Publish	Code	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention				note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference				note

Bin Lin

Meta	Title	Cover	Publish	Code	Note
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note

Bin Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs	note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding	note

Bing Xue

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Bingxuan Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Bo Yang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Bochao Wu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Bohan Zhuang

Meta	Title	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note

Boji Shan

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Bowen Xu

Meta	Title	Cover	Publish	Code	Note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Chang Chen

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention				note

Chang Gao

Meta	Title	Note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference	note
Qwen3	Qwen3 Technical Report	note

Chao Wang

Meta	Title	Cover	Publish	Code	Note
07NWF4VE	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Chao Yang

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention				note

Chaojun Xiao

Meta	Title	Note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note

Chen Chen

Meta	Title	Cover	Publish	Code	Note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models				note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference				note

Chen Zhang

Meta	Title	Cover	Publish	Code	Note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note
ZigZagKV	ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty				note

Cheng Zhu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Chengda Lu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Chenggang Zhao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Chengqi Deng

Meta	Title	Note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Chengquan Jiang

Meta	Title	Cover	Publish	Code	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion				note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts				note

Chengruidong Zhang

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note

Chenyang Song

Meta	Title	Note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note

Chenyu Zhang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Chong Ruan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Christos Kozyrakis

Meta	Title	Cover	Publish	Code	Note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs				note
LIMINAL	Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need				note

Chuang Gan

Meta	Title	Cover	Publish	Code	Note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving			Pytorch	note

Chunhao Zhang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Clark Barrett

Meta	Title	Cover	Publish	Code	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models				note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs				note

Cody Hao Yu

Meta	Title	Cover	Publish	Code	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention				note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs				note

Coleman Hooper

Meta	Title	Cover	Publish	Code	Note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization				note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization				note

Congchao Guo

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Da Chen

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Damai Dai

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Dan Alistarh

Meta	Title	Note
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
SPDY	SPDY: Accurate Pruning with Speedup Guarantees	note
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

Daxin Jiang

Meta	Title	Note
MFA	Multi-matrix Factorization Attention	note
LAVa	LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation	note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding	note

Daya Guo

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

DeepSeek-AI

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Dejian Yang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Deli Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Dianhai Yu

Meta	Title	Cover	Publish	Code	Note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention				note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs				note

Dong Li

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs	note

Dongjie Ji

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Dongsheng Li

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note

Dongwon Jo

Meta	Title	Cover	Publish	Code	Note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation				note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation				note

Dongyang Wang

Meta	Title	Cover	Publish	Code	Note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives				note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler				note

Eldar Kurtic

Meta	Title	Note
oBERT	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

Elias Frantar

Meta	Title	Note
SPDY	SPDY: Accurate Pruning with Speedup Guarantees	note
OBC	Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
SparseGPT	SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot.
ZipLM	ZipLM: Inference-Aware Structured Pruning of Language Models
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models

Emad Barsoum

Meta	Title	Cover	Publish	Code	Note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism				note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization				note

Enwei Jiao

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Erhang Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Fan Yang

Meta	Title	Note
nmSPARSE	Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

Fangcheng Fu

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling				note

Fangyun Lin

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Fei Huang

Meta	Title	Cover	Publish	Code	Note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration				note
Qwen3	Qwen3 Technical Report				note

Fucong Dai

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Fuli Luo

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Furu Wei

Meta	Title	Cover	Publish	Code	Note
Q-Sparse	Q-Sparse: All Large Language Models can be Fully Sparsely-Activated				note
ReSA	Rectified Sparse Attention				note

Genghan Zhang

Meta	Title	Cover	Publish	Code	Note
CATS	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models				note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note

Gongfan Fang

Meta	Title	Cover	Publish	Code	Note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models				note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note

Guanchen Li

Meta	Title	Note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization	note

Guangbo Hao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Guangxuan Xiao

Meta	Title	Code	Note
streaming-llm	Efficient Streaming Language Models with Attention Sinks		note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference		note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note

Guanting Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Guohao Dai

Meta	Title	Note
m	A Survey on Efficient Inference for Large Language Models	note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting	note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	note

Guowei Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

H. Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hai Zhao

Meta	Title	Cover	Publish	Code	Note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption				note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models				note

Haibin Lin

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Haibo Chen

Meta	Title	Code	Note
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note

Haifeng Wang

Meta	Title	Cover	Publish	Code	Note
FlashMask	FlashMask: Efficient and Rich Mask Extension of FlashAttention				note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs				note

Hailin Zhang

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling				note

Han Bao

Meta	Title	Note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hanshi Sun

Meta	Title	Cover	Publish	Code	Note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference				note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration				note

Hanwei Xu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hao Zhang

Meta	Title	Cover	Publish	Code	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention				note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models				note

Haocheng Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Haocheng Xi

Meta	Title	Note
m	Training Transformers with 4-bit Integers
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

Haofeng Huang

Meta	Title	Note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Haohai Sun

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Haoli Bai

Meta	Title	Note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models	note
FreqKV	FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension	note

Haotian Tang

Meta	Title	Code	Note
TorchSparse++	TorchSparse++: Efficient Point Cloud Engine
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note

Haotong Xie

Meta	Title	Cover	Publish	Code	Note
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU				note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters			Pytorch	note

Haowei Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hayden Kwok-Hay So

Meta	Title	Cover	Publish	Code	Note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Heung-Yeung Shum

Meta	Title	Cover	Publish	Code	Note
MFA	Multi-matrix Factorization Attention				note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding				note

Hong Zhou

Meta	Title	Cover	Publish	Code	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification				note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification				note

Honghui Ding

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hongsheng Li

Meta	Title	Cover	Publish	Code	Note
SR-STE	Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Hrayr Harutyunyan

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Huajian Xin

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Huazuo Gao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hui Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Hui Qu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Huiqiang Jiang

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note

Iman Mirzadeh

Meta	Title	Cover	Publish	Code	Note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory				note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Ion Stoica

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

J. L. Cai

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

J. Zico Kolter

Meta	Title	Cover	Publish	Code	Note
Wanda	A Simple and Effective Pruning Approach for Large Language Models				note
massive-activations	Massive Activations in Large Language Models				note

Jae-Joon Kim

Meta	Title	Cover	Publish	Code	Note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation				note
RetroAttention	Retrospective Sparse Attention for Efficient Long-Context Generation				note

Jan Kautz

Meta	Title	Cover	Publish	Code	Note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note
Minitron	Compact Language Models via Pruning and Knowledge Distillation				note

Jason D. Lee

Meta	Title	Cover	Publish	Code	Note
MeZO	Fine-Tuning Language Models with Just Forward Passes				note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark				note

Jayashree Mohan

Meta	Title	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note

Jeff Pool

Meta	Title	Cover	Publish	Code	Note
m	Channel Permutations for N:M Sparsity
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note

Jia Wei

Meta	Title	Note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Jiajie Zhang

Meta	Title	Cover	Publish	Code	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Jiaming Tang

Meta	Title	Note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	note

Jiaming Xu

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
SpecEE	SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting				note

Jian Liang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jianfei Chen

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
m	Training Transformers with 4-bit Integers
m	Accelerating Transformer Pre-training with 2:4 Sparsity	note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
ReMoE	ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing	note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Jianfeng Gao

Meta	Title	Cover	Publish	Code	Note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods				note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention				note

Jiangfei Duan

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention				note

Jianxi Ye

Meta	Title	Cover	Publish	Code	Note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives				note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler				note

Jianyong Wang

Meta	Title	Cover	Publish	Code	Note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding				note
ReSA	Rectified Sparse Attention				note

Jianzhong Guo

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jiaqi Ni

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jiaqi Zhuang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Jiashi Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jiawei Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Jiayuan Song

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Jidong Zhai

Meta	Title	Cover	Publish	Code	Note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler				note

Jie Liu

Meta	Title	Cover	Publish	Code	Note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note
Task-KV	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads				note

Jie Tang

Meta	Title	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note

Jie Ye

Meta	Title	Cover	Publish	Code	Note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts				note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference				note

Jie Zhou

Meta	Title	Cover	Publish	Code	Note
DBudgetKV	DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Jin Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jin Fang

Meta	Title	Cover	Publish	Code	Note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives				note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler				note

Jin Zhu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Jing Liu

Meta	Title	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note

Jingchang Chen

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Jingyang Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Jingyang Yuan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Jintao Zhang

Meta	Title	Note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Joseph E. Gonzalez

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note

Juanzi Li

Meta	Title	Cover	Publish	Code	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Jun Zhu

Meta	Title	Note
m	Training Transformers with 4-bit Integers
m	Accelerating Transformer Pre-training with 2:4 Sparsity	note
ReMoE	ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing	note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	note
SpargeAttn	SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Junhao Xu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Junjie Qiu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Junjie Yan

Meta	Title	Note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note

Junlong Li

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Junxian Guo

Meta	Title	Note
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	note

Junxiao Song

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Junyang Lin

Meta	Title	Cover	Publish	Code	Note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration				note
Qwen3	Qwen3 Technical Report				note

Kai Dong

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Kai Hu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Kaige Gao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Kan Zhu

Meta	Title	Cover	Publish	Code	Note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference				note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput				note

Kang Guan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Kang Zhao

Meta	Title	Note
m	Accelerating Transformer Pre-training with 2:4 Sparsity	note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs	note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models	note

Ke Hong

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation				note

Kecheng Xiao

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Kehong Yuan

Meta	Title	Note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs	note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models	note

Kexin Huang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Krishna Teja Chitty-Venkata

Meta	Title	Cover	Publish	Code	Note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts				note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference				note

Kuai Yu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Kurt Keutzer

Meta	Title	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

Le Han

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Lean Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Lecong Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Lefei Zhang

Meta	Title	Cover	Publish	Code	Note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models				note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers				note

Lei Chen

Meta	Title	Cover	Publish	Code	Note
PWGG5HBE	A Survey on Large Language Model Acceleration based on KV Cache Management				note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference				note

Lei Xu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Leyang Wang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Leyi Xia

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Li Dong

Meta	Title	Cover	Publish	Code	Note
ReSA	Rectified Sparse Attention				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Li-Wen Chang

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Lian Liu

Meta	Title	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving	note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism	note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	note

Lianfei Yu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Liang Zhao

Meta	Title	Note
SparseLLM	SparseLLM: Towards Global Pruning for Pre-trained Language Models	note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Lianmin Zheng

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note

Liheng Feng

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Lili Qiu

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note

Lin Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs				note

Lin Zheng

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Litong Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Liyue Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Lu Hou

Meta	Title	Cover	Publish	Code	Note
RIA	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models				note

Mao Yang

Meta	Title	Note
Compresso	Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models	note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

Maosong Sun

Meta	Title	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens	note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note

Marcos Treviso

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey
AdaSplash	AdaSplash: Adaptive Sparse Flash Attention				note

Mark Kurtz

Meta	Title	Cover	Publish	Code	Note
m	Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment				note

Mehrdad Farajtabar

Meta	Title	Cover	Publish	Code	Note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory				note
ReLU Strikes Back	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Meng Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Mengdi Wang

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation				note

Miaojun Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Michael Goin

Meta	Title	Cover	Publish	Code	Note
SquareHead	Sparse Fine-tuning for Inference Acceleration of Large Language Models
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment				note

Michael Hassid

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey
TOVA	Transformers are Multi-State RNNs				note

Michael W. Mahoney

Meta	Title	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note

Mingchuan Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Minghao Li

Meta	Title	Cover	Publish	Code	Note
CCQ	CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Minghua Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Minghui Tang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Mingjie Sun

Meta	Title	Note
GBLM-Pruner	Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models	note
Wanda	A Simple and Effective Pruning Approach for Large Language Models	note
massive-activations	Massive Activations in Large Language Models	note

Mingming Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Mingyuan Chi

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

MiniMax

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Minmin Sun

Meta	Title	Cover	Publish	Code	Note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration				note

Minsik Cho

Meta	Title	Cover	Publish	Code	Note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory				note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference				note

Mohammad Rastegari

Meta	Title	Cover	Publish	Code	Note
LLM in a flash	LLM in a flash: Efficient Large Language Model Inference with Limited Memory				note
LazyLLM	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference				note

Mozhi Zhang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Murali Emani

Meta	Title	Cover	Publish	Code	Note
MoPEQ	MoPEQ: Mixture of Mixed Precision Quantized Experts				note
PagedEviction	PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference				note

Ning Tian

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ningxin Zheng

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Nipun Kwatra

Meta	Title	Cover	Publish	Code	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference				note
TokenWeave	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference				note

Panpan Huang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Pavlo Molchanov

Meta	Title	Cover	Publish	Code	Note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note
Minitron	Compact Language Models via Pruning and Knowledge Distillation				note

Peiyi Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Peng Gao

Meta	Title	Cover	Publish	Code	Note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Peng Sun

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving				note

Peng Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Pengcheng He

Meta	Title	Cover	Publish	Code	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models				note

Pengfei Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Pengfei Zuo

Meta	Title	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference	note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation	note
PSA	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	note

Pengle Zhang

Meta	Title	Note
SageAttention2	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	note
SageAttention	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	note
SageAttention3	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training	note

Pengyu Zhao

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Ping Luo

Meta	Title	Cover	Publish	Code	Note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization				note

Qi Hou

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Qianchao Zhu

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention				note

Qiancheng Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Qianhui Wu

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note

Qidi Xu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Qihao Zhu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Qin Wang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Qingru Zhang

Meta	Title	Cover	Publish	Code	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM				note

Qinkai Zheng

Meta	Title	Cover	Publish	Code	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Qinyu Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note
DeltaLLM	DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference	note

Qiushi Du

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

R. J. Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

R. L. Jin

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ramachandran Ramjee

Meta	Title	Note
Vidur	Vidur: A Large-Scale Simulation Framework For LLM Inference	note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference	note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	note
TokenWeave	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference	note

Ramya Prabhu

Meta	Title	Cover	Publish	Code	Note
POD-Attention	POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference				note
vAttention	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention				note

Rayan Saab

Meta	Title	Cover	Publish	Code	Note
GPFQ	A Greedy Algorithm for Quantizing Neural Networks
GPFQv2	Post-training Quantization for Neural Networks with Provable Guarantees

Roy Schwartz

Meta	Title	Cover	Publish	Code	Note
m	Efficient Methods for Natural Language Processing: A Survey
TOVA	Transformers are Multi-State RNNs				note

Ruihang Lai

Meta	Title	Cover	Publish	Code	Note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models				note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Ruiqi Ge

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ruisong Zhang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Ruitao Leng

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Ruizhe Pan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Runji Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Runxin Xu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ruoyu Zhang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Ruyi Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

S. S. Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Saeed Maleki

Meta	Title	Cover	Publish	Code	Note
CoCoNet	Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads				note
Splitwise	Splitwise: Efficient generative LLM inference using phase splitting				note

Sangmin Bae

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Saurav Muralidharan

Meta	Title	Cover	Publish	Code	Note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note
Minitron	Compact Language Models via Pruning and Knowledge Distillation				note

Sean Lie

Meta	Title	Note
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note
m	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment	note

Sehoon Kim

Meta	Title	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization	note
KVQuant	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	note

Shang Yang

Meta	Title	Code	Note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note

Shanghao Lu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shangyan Zhou

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shanhuang Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shaoqing Wu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shengding Hu

Meta	Title	Cover	Publish	Code	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens				note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models				note

Shengen Yan

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note

Shengfeng Ye

Meta	Title	Code	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model		note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note	None
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note	None
DeepSeek-V3	DeepSeek-V3 Technical Report	note	None
DeepSeek-V3	DeepSeek-V3 Technical Report	note	None

Shengmin Shi

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Shijie Cao

Meta	Title	Note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
ReSA	Rectified Sparse Attention	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

Shirong Ma

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shiwei Liu

Meta	Title	Note
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Shiyao Li

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note

Shiyu Chang

Meta	Title	Cover	Publish	Code	Note
IFPruning	Instruction-Following Pruning for Large Language Models				note
KVLink	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse				note

Shiyu Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shreyas Saxena

Meta	Title	Note
SPDF	SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note

Shuang Zhou

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shuiping Yu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shunfeng Zhou

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Shuo Wang

Meta	Title	Cover	Publish	Code	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Shuo Yang

Meta	Title	Note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note
HashAttention	HashAttention: Semantic Sparsity for Faster Inference	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

Shuqi Yu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Shuting Pan

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Sichen Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Size Zheng

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Song Han

Meta	Title	Code	Note
Deep Compression	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
DSD	DSD: Dense-Sparse-Dense Training for Deep Neural Networks
SparseViT	SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer		note
TorchSparse++	TorchSparse++: Efficient Point Cloud Engine
streaming-llm	Efficient Streaming Language Models with Attention Sinks		note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference		note
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
DuoAttention	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads		note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
XAttention	XAttention: Block Sparse Attention with Antidiagonal Scoring		note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation		note

Songquan Zhu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Stephanie Wang

Meta	Title	Cover	Publish	Code	Note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput				note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Surin Ahn

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note

T. Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Tal Schuster

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Tao Xie

Meta	Title	Cover	Publish	Code	Note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note
RaaS	Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity				note

Tao Yu

Meta	Title	Cover	Publish	Code	Note
FLAP	Fluctuation-based Adaptive Structured Pruning for Large Language Models
MXFP4Train	Training LLMs with MXFP4				note

Tao Yuan

Meta	Title	Cover	Publish	Code	Note
m	Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs				note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models				note

Tao Yun

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Tian Pei

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Tianle Cai

Meta	Title	Note
SnapKV	SnapKV: LLM Knows What You are Looking for Before Generation	note
TEAL	Training-Free Activation Sparsity in Large Language Models	note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation	note

Tianlong Chen

Meta	Title	Cover	Publish	Code	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models				note
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark				note

Tianqi Chen

Meta	Title	Cover	Publish	Code	Note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models				note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Tianqi Wu

Meta	Title	Cover	Publish	Code	Note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation				note

Tianrun Liang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Tianyu Fu

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note

Tianyu Gao

Meta	Title	Cover	Publish	Code	Note
MeZO	Fine-Tuning Language Models with Just Forward Passes				note
LLM-shearing	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning				note

Tianyu Sun

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Tianzhu Ye

Meta	Title	Cover	Publish	Code	Note
ReSA	Rectified Sparse Attention				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Tim Dettmers

Meta	Title	Cover	Publish	Code	Note
QLoRA	QLoRA: Efficient Finetuning of Quantized LLMs
SpQR	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Ting Cao

Meta	Title	Cover	Publish	Code	Note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Tong Yang

Meta	Title	Cover	Publish	Code	Note
HATA	HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference				note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference				note

Torsten Hoefler

Meta	Title	Note
m	Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
VENOM	VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores	note
SliceGPT	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	note

Tri Dao

Meta	Title	Note
FlashAttention	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Flash-Decoding	Flash-Decoding for long-context inference	note
FlashAttention-2	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
GLA	Hardware-Efficient Attention for Fast Decoding	note

Tuo Zhao

Meta	Title	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	note
GEAR	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	note

Vithursan Thangarasa

Meta	Title	Note
SPDF	SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Sparse-IFT	Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Sparse-IFT	Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency	note

W. L. Xiao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wangding Zeng

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wanjia Zhao

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Wei An

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wei Lin

Meta	Title	Cover	Publish	Code	Note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity				note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note

Wei Wang

Meta	Title	Cover	Publish	Code	Note
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention				note

Weigao Sun

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

Weilin Zhao

Meta	Title	Cover	Publish	Code	Note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Weixuan Sun

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note

Weiyu Cheng

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Weiyu Huang

Meta	Title	Cover	Publish	Code	Note
m	Accelerating Transformer Pre-training with 2:4 Sparsity				note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training				note

Weizhu Chen

Meta	Title	Note
LoRA	LoRA: Low-rank adaptation of large language models
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	note

Wen Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wenfeng Liang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wenjun Gao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Wenkai Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Wenlei Bao

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Wenqi Shao

Meta	Title	Note
OmniQuant	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
PrefixQuant	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note

Wenqin Yu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Wentao Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Woosuk Kwon

Meta	Title	Note
FisherPruning	A Fast Post-Training Pruning Framework for Transformers	note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
APEX	APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving	note

Wulong Liu

Meta	Title	Cover	Publish	Code	Note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference				note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs				note

X. Q. Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiafei Qiu

Meta	Title	Cover	Publish	Code	Note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity				note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache				note

Xiandong Zhao

Meta	Title	Cover	Publish	Code	Note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism				note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation				note

Xiang Liu

Meta	Title	Cover	Publish	Code	Note
LISA	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning				note
ChunkKV	ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference				note

Xiangjun Song

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xiangyu Zhang

Meta	Title	Cover	Publish	Code	Note
MFA	Multi-matrix Factorization Attention				note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding				note

Xiangyue Jin

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xianzhi Yu

Meta	Title	Cover	Publish	Code	Note
LinearPatch	A Simple Linear Patch Revives Layer-Pruned Large Language Models				note
AttentionPredictor	AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference				note

Xianzu Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiao Bi

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiao Su

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xiaodong Han

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xiaodong Ji

Meta	Title	Cover	Publish	Code	Note
PQCache	PQCache: Product Quantization-based KVCache for Long Context LLM Inference				note
SALE	SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling				note

Xiaodong Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaohan Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaojin Shen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaokang Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaokang Zhang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xiaosha Chen

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaotao Gu

Meta	Title	Cover	Publish	Code	Note
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Xiaotao Nie

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaowei Li

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation				note

Xiaowen Sun

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xiaoxiang Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xin Chen

Meta	Title	Cover	Publish	Code	Note
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models				note
LaRoSA	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation				note

Xin Cheng

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xin Jin

Meta	Title	Cover	Publish	Code	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion				note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production				note

Xin Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
ShadowKV	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Xin Lv

Meta	Title	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note

Xin Xie

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xinchao Wang

Meta	Title	Cover	Publish	Code	Note
LLM-Pruner	LLM-Pruner: On the Structural Pruning of Large Language Models				note
MaskLLM	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models				note

Xingchao Liu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xingcheng Zhang

Meta	Title	Cover	Publish	Code	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning				note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention				note

Xingkai Yu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xinnan Song

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xinxia Shan

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xinyi Zhou

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xinyu Yang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xinyu Zhou

Meta	Title	Cover	Publish	Code	Note
0VRXJQ3F	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving				note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs				note

Xinyuan Li

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xinzhu Hou

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xiuhong Li

Meta	Title	Note
Centauri	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	note
m	A Survey on Efficient Inference for Large Language Models	note
SampleAttention	SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention	note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	note

Xu Han

Meta	Title	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens	note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note

Xu Owen He

Meta	Title	Cover	Publish	Code	Note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate				note
ACP	Adaptive Computation Pruning for the Forgetting Transformer				note

Xuan Lu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xuecheng Su

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Xuefei Ning

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note

Xuegui Zheng

Meta	Title	Cover	Publish	Code	Note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives				note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler				note

Xufang Luo

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note

Xuheng Lin

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Xun Zou

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Xuyang Shen

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note

Y. K. Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Y. Q. Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Y. Wu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model				note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models				note

Y. X. Wei

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Y. X. Zhu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yan Gong

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yang Li

Meta	Title	Cover	Publish	Code	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X				note
ChunkAttention	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition				note

Yang Zhang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yanhong Xu

Meta	Title	Code	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model		note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note	None
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note	None
DeepSeek-V3	DeepSeek-V3 Technical Report	note	None
DeepSeek-V3	DeepSeek-V3 Technical Report	note	None

Yankai Lin

Meta	Title	Cover	Publish	Code	Note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Yanping Huang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yao Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yao Zhao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yaofeng Sun

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yaohui Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yaohui Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yefei He

Meta	Title	Note
ZipCache	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
ZipVL	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	note

Yehui Tang

Meta	Title	Cover	Publish	Code	Note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models				note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs				note

Yi Yu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yi Zheng

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yichao Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yifan Shi

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yikai Zhang

Meta	Title	Cover	Publish	Code	Note
PowerAttention	PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention				note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration				note

Yiliang Xiong

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yilong Zhao

Meta	Title	Note
Quest	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models	note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput	note

Ying He

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ying Sheng

Meta	Title	Note
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
SGLang	SGLang: Efficient Execution of Structured Language Model Programs	note
DoubleSparsity	Post-Training Sparse Attention with Double Sparsity	note

Ying Tang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yingfa Chen

Meta	Title	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note

Yinhe Han

Meta	Title	Cover	Publish	Code	Note
COMET	COMET: Towards Partical W4A4KV4 LLMs Serving				note
BaWA	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation				note

Yiran Zhong

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention	note

Yishi Piao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yisong Wang

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yiwu Yao

Meta	Title	Cover	Publish	Code	Note
DSnoT	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs				note
AmberPruner	Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models				note

Yixiao Li

Meta	Title	Cover	Publish	Code	Note
LoSparse	Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LoftQ	LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models				note

Yixin Dong

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model				note
XGrammar	XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models				note

Yixin Song

Meta	Title	Code	Note
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note

Yixuan Tan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yiyang Ma

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yiyuan Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yiyuan Ma

Meta	Title	Cover	Publish	Code	Note
FlexPrefill	FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference				note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production				note

Yizhao Gao

Meta	Title	Note
SeerAttention	SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs	note
ReSA	Rectified Sparse Attention	note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note

Yong Li

Meta	Title	Note
Flash-LLM	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity	note
DistAttention	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	note
CateKV	CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration	note

Yongqiang Guo

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yongyi Hu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yu Cheng

Meta	Title	Note
AdaLoRA	AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning	note
Awesome-Efficient-Arch	Speed Always Wins: A Survey on Efficient Architectures for Large Language Models	note

Yu Wang

Meta	Title	Note
m	A Survey on Efficient Inference for Large Language Models	note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression	note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	note

Yu Wu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yuan Ou

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yuandong Tian

Meta	Title	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
streaming-llm	Efficient Streaming Language Models with Attention Sinks	note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	note

Yuanxiang Fan

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yuchen Zhu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yucheng Li

Meta	Title	Note
Selective Context	Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration	note

Yuduan Wang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yue Gong

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yuezhou Hu

Meta	Title	Cover	Publish	Code	Note
m	Accelerating Transformer Pre-training with 2:4 Sparsity				note
AdaptiveSparseTrainer	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training				note

Yufeng Yang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yuhang Li

Meta	Title	Cover	Publish	Code	Note
BRECQ	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Yuhao Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yuheng Zou

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yuhui Xu

Meta	Title	Cover	Publish	Code	Note
QA-LoRA	QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models				note
SPP	SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models				note

Yujia He

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yujun Lin

Meta	Title	Code	Note
QServe	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Pytorch	note
LServer	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention		note
RadialAttention	Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation		note

Yukun Zha

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yulhwa Kim

Meta	Title	Cover	Publish	Code	Note
L4Q	L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ				note
FastKV	FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation				note

Yunan Huang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yunfan Xiong

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yunhe Wang

Meta	Title	Cover	Publish	Code	Note
SlimLLM	SlimLLM: Accurate Structured Pruning for Large Language Models				note
PanguUltra	Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs				note

Yunji Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yunxian Ma

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yunzhi Xu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yuqing Xia

Meta	Title	Cover	Publish	Code	Note
ReSA	Rectified Sparse Attention				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Yuqing Yang

Meta	Title	Note
MInference	MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention	note
SCBench	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	note
MMInference	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	note
LeanK	LeanK: Learnable K Cache Channel Pruning for Efficient Decoding	note

Yushi Bai

Meta	Title	Cover	Publish	Code	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Yutao Sun

Meta	Title	Cover	Publish	Code	Note
ReSA	Rectified Sparse Attention				note
SeerAttention-R	SeerAttention-R: Sparse Attention Adaptation for Long Reasoning				note

Yuting Yan

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yuxiang Luo

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Yuxiang You

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yuxiao Dong

Meta	Title	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note

Yuxin Mao

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Yuxin Wu

Meta	Title	Cover	Publish	Code	Note
AVSS	AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis				note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs				note

Yuxiong He

Meta	Title	Cover	Publish	Code	Note
ZeroQuant	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
ZeroQuant-V2	ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Yuxuan Li

Meta	Title	Cover	Publish	Code	Note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Yuxuan Liu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Yuyang Zhou

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Z. F. Wu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Z. Z. Ren

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zefan Cai

Meta	Title	Cover	Publish	Code	Note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration				note
KVCache-Factory	Unified KV Cache Compression Methods for Auto-Regressive Models				note

Zehan Li

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Zehui Ren

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zeping Li

Meta	Title	Cover	Publish	Code	Note
SDS	Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism				note
Týr-the-Pruner	Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization				note

Zeyu Mi

Meta	Title	Code	Note
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		note
PowerInfer-2	PowerInfer-2: Fast Large Language Model Inference on a Smartphone	Website	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note

Zhangli Sha

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhangyang Wang

Meta	Title	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
m	Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers
Essential Sparsity	The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
LLM-KICK	Compressing LLMs: The Truth is Rarely Pure and Never Simple	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
m	Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark	note
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	note

Zhe Fu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhean Xu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zhen Dong

Meta	Title	Cover	Publish	Code	Note
SqueezeLLM	SqueezeLLM: Dense-and-Sparse Quantization				note
R-KV	R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration				note

Zhen Huang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhen Qin

Meta	Title	Note
LightningAttention	Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention	note
LightningAttention-2	Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models	note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention	note

Zhen Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhenda Xie

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhengxiao Du

Meta	Title	Cover	Publish	Code	Note
LongBench	LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding				note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models				note

Zhengyan Zhang

Meta	Title	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report		note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models		note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs		note
Turbo Sparse	Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters	Pytorch	note
NSA	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention		note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning		note

Zhenhua Fan

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Zhenyu Zhang

Meta	Title	Note
H2O	H $_2$ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	note
OWL	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
R-Sparse	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	note

Zhewei Yao

Meta	Title	Cover	Publish	Code	Note
ActNN	ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
ZeroQuant	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
ZeroQuant-V2	ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Zhewen Hao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhibin Gou

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zhicheng Ma

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zhigang Yan

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zhihang Yu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Zhihang Yuan

Meta	Title	Cover	Publish	Code	Note
RPTQ	RPTQ: Reorder-based Post-training Quantization for Large Language Models				note
m	A Survey on Efficient Inference for Large Language Models				note

Zhihong Shao

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhilin Yang

Meta	Title	Cover	Publish	Code	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X				note
MoBA	MoBA: Mixture of Block Attention for Long-Context LLMs				note

Zhipeng Xu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhixuan Lin

Meta	Title	Cover	Publish	Code	Note
FoX	Forgetting Transformer: Softmax Attention with a Forget Gate				note
ACP	Adaptive Computation Pruning for the Forgetting Transformer				note

Zhiyu Wu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zhiyuan Liu

Meta	Title	Note
InfiniteBench	$\infty$ Bench: Extending Long Context Evaluation Beyond 100K Tokens	note
ProSparse	ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models	note
ReLU2	ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs	note
BlockFFN	BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity	note
SparsingLaw	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices	note

Zhongyu Zhang

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zhou Yu

Meta	Title	Cover	Publish	Code	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention				note
Adrenaline	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation				note

Zhuang Liu

Meta	Title	Cover	Publish	Code	Note
Wanda	A Simple and Effective Pruning Approach for Large Language Models				note
massive-activations	Massive Activations in Large Language Models				note

Zhuo Jiang

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Zhuomin He

Meta	Title	Cover	Publish	Code	Note
CachedAttention	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention				note
AdaSkip	AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference				note

Zhuoshu Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zihan Wang

Meta	Title	Code	Note
CodeGeeX	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X		note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model		note
KeepKV	KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference		note
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note	None
GLM-4.5	GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	note	None

Zihao Ye

Meta	Title	Cover	Publish	Code	Note
NanoFlow	NanoFlow: Towards Optimal Large Language Model Serving Throughput				note
FlashInfer	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving				note

Ziheng Jiang

Meta	Title	Note
FLUX	FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion	note
CometSeed	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	note
MegaScale-MoE	MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production	note
TileLink	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	note
Triton-distributed	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	note

Zihui Gu

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zijia Wu

Meta	Title	Cover	Publish	Code	Note
MiniMax-01	MiniMax-01: Scaling Foundation Models with Lightning Attention				note
MiniMax-M1	MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention				note

Zijia Zhu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zijun Liu

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zili Wang

Meta	Title	Cover	Publish	Code	Note
MFA	Multi-matrix Factorization Attention				note
Step-3	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding				note

Zilin Li

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Ziqing Yang

Meta	Title	Cover	Publish	Code	Note
TextPruner	TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
GRAIN	Gradient-based Intra-attention Pruning on Pre-trained Language Models				note

Ziwei Ji

Meta	Title	Cover	Publish	Code	Note
RecursiveTransformers	Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA				note
MoR	Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation				note

Ziwei Xie

Meta	Title	Note
DeepSeek-V2	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zixiao Huang

Meta	Title	Cover	Publish	Code	Note
MoA	MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression				note
FlashOverlap	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation				note

Zixuan Zhou

Meta	Title	Cover	Publish	Code	Note
m	A Survey on Efficient Inference for Large Language Models				note
MiniCPM4	MiniCPM4: Ultra-Efficient LLMs on End Devices				note

Ziyang Song

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Ziyi Gao

Meta	Title	Cover	Publish	Code	Note
DeepSeek-V3	DeepSeek-V3 Technical Report				note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning				note

Zizheng Pan

Meta	Title	Note
DeepSeek-V3	DeepSeek-V3 Technical Report	note
MiniCache	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	note
DeepSeek-R1	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	note

Zuchao Li

Meta	Title	Note
m	Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption	note
SIFT	Sparse is Enough in Fine-tuning Pre-trained Large Language Models	note
SpindleKV	SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers	note

Zunhai Su

Meta	Title	Note
KVSink	KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs	note
RotateKV	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	note
Super-Experts-Profilling	Unveiling Super Experts in Mixture-of-Experts Large Language Models	note