author

Aaron Courville

Meta Title Cover Publish Code Note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Abhay Gupta

Meta Title Cover Publish Code Note
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Adam Fisch

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Aixin Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ajay Jaiswal

Meta Title Cover Publish Code Note
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note

Amir Gholami

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note

Amir H. Abdi

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

André F. T. Martins

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note

Aojun Zhou

Meta Title Cover Publish Code Note
SR-STE Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch cover Publish GitHub Repo stars
STA An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers Publish
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Arvind Krishnamurthy

Meta Title Cover Publish Code Note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Ashish Panwar

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note

Bairu Hou

Meta Title Cover Publish Code Note
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note

Baris Kasikci

Meta Title Cover Publish Code Note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Bei Feng

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Beidi Chen

Meta Title Cover Publish Code Note
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note

Bin Gao

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note

Bin Lin

Meta Title Cover Publish Code Note
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note

Bin Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Bing Xue

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Bingxuan Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Bochao Wu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Chang Chen

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Chang Gao

Meta Title Cover Publish Code Note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note
Qwen3 Qwen3 Technical Report cover Publish GitHub Repo stars note

Chao Yang

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Chaojun Xiao

Meta Title Cover Publish Code Note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Chen Chen

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note

Chen Zhang

Meta Title Cover Publish Code Note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note

Chengda Lu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Chenggang Zhao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Chengqi Deng

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Chengquan Jiang

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note

Chengruidong Zhang

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note

Chenyang Song

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note

Chenyu Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Chong Ruan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Christos Kozyrakis

Meta Title Cover Publish Code Note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
LIMINAL Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need Publish note

Chuang Gan

Meta Title Cover Publish Code Note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note

Clark Barrett

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note

Cody Hao Yu

Meta Title Cover Publish Code Note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note

Coleman Hooper

Meta Title Cover Publish Code Note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note

Damai Dai

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Dan Alistarh

Meta Title Cover Publish Code Note
m Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference Publish
m Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Publish
SPDY SPDY: Accurate Pruning with Speedup Guarantees cover Publish GitHub Repo stars note
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
SpQR SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Publish GitHub Repo stars
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Daxin Jiang

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Daya Guo

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

DeepSeek-AI

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Dejian Yang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Deli Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Dianhai Yu

Meta Title Cover Publish Code Note
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note
CCQ CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs Publish note

Dong Li

Meta Title Cover Publish Code Note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note

Dongjie Ji

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Dongsheng Li

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

Dongyang Wang

Meta Title Cover Publish Code Note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Eldar Kurtic

Meta Title Cover Publish Code Note
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Elias Frantar

Meta Title Cover Publish Code Note
SPDY SPDY: Accurate Pruning with Speedup Guarantees cover Publish GitHub Repo stars note
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars

Erhang Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Fan Yang

Meta Title Cover Publish Code Note
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Fangyun Lin

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Fei Huang

Meta Title Cover Publish Code Note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
Qwen3 Qwen3 Technical Report cover Publish GitHub Repo stars note

Fucong Dai

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Fuli Luo

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Furu Wei

Meta Title Cover Publish Code Note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note

Genghan Zhang

Meta Title Cover Publish Code Note
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note

Gongfan Fang

Meta Title Cover Publish Code Note
LLM-Pruner LLM-Pruner: On the Structural Pruning of Large Language Models cover Publish GitHub Repo stars note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note

Guanchen Li

Meta Title Cover Publish Code Note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Guangbo Hao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Guangxuan Xiao

Meta Title Cover Publish Code Note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Guanting Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Guohao Dai

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Guowei Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

H. Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hai Zhao

Meta Title Cover Publish Code Note
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note

Haibin Lin

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Haibo Chen

Meta Title Cover Publish Code Note
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note

Haifeng Wang

Meta Title Cover Publish Code Note
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note
CCQ CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs Publish note

Han Bao

Meta Title Cover Publish Code Note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hanshi Sun

Meta Title Cover Publish Code Note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

Hanwei Xu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hao Zhang

Meta Title Cover Publish Code Note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note

Haocheng Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Haocheng Xi

Meta Title Cover Publish Code Note
m Training Transformers with 4-bit Integers Publish GitHub Repo stars
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Haofeng Huang

Meta Title Cover Publish Code Note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Haoli Bai

Meta Title Cover Publish Code Note
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
FreqKV FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension cover Publish note

Haotian Tang

Meta Title Cover Publish Code Note
TorchSparse++ TorchSparse++: Efficient Point Cloud Engine Publish GitHub Repo stars
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Haotong Xie

Meta Title Cover Publish Code Note
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note

Haowei Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hayden Kwok-Hay So

Meta Title Cover Publish Code Note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Heung-Yeung Shum

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Honghui Ding

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hongsheng Li

Meta Title Cover Publish Code Note
SR-STE Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch cover Publish GitHub Repo stars
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Hrayr Harutyunyan

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Huajian Xin

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Huazuo Gao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hui Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Hui Qu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Huiqiang Jiang

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note

Iman Mirzadeh

Meta Title Cover Publish Code Note
LLM in a flash LLM in a flash: Efficient Large Language Model Inference with Limited Memory cover Publish note
ReLU Strikes Back ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models cover Publish GitHub Repo stars

Ion Stoica

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

J. L. Cai

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

J. Zico Kolter

Meta Title Cover Publish Code Note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note

Jan Kautz

Meta Title Cover Publish Code Note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note

Jason D. Lee

Meta Title Cover Publish Code Note
MeZO Fine-Tuning Language Models with Just Forward Passes Publish GitHub Repo stars note
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note

Jayashree Mohan

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note

Jeff Pool

Meta Title Cover Publish Code Note
m Channel Permutations for N:M Sparsity Publish GitHub Repo stars
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note

Jia Wei

Meta Title Cover Publish Code Note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Jiaming Tang

Meta Title Cover Publish Code Note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Jiaming Xu

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note

Jian Liang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jianfei Chen

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
m Training Transformers with 4-bit Integers Publish GitHub Repo stars
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
ReMoE ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing cover Publish GitHub Repo stars note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Jianfeng Gao

Meta Title Cover Publish Code Note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

Jiangfei Duan

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Jianxi Ye

Meta Title Cover Publish Code Note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Jianyong Wang

Meta Title Cover Publish Code Note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note

Jianzhong Guo

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jiaqi Ni

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jiashi Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jiawei Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jie Zhou

Meta Title Cover Publish Code Note
DBudgetKV DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance cover Publish note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Jin Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jin Fang

Meta Title Cover Publish Code Note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Jingchang Chen

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jingyang Yuan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Jintao Zhang

Meta Title Cover Publish Code Note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Joseph E. Gonzalez

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note

Jun Zhu

Meta Title Cover Publish Code Note
m Training Transformers with 4-bit Integers Publish GitHub Repo stars
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
ReMoE ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing cover Publish GitHub Repo stars note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Junjie Qiu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Junlong Li

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Junxian Guo

Meta Title Cover Publish Code Note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Junxiao Song

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Junyang Lin

Meta Title Cover Publish Code Note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
Qwen3 Qwen3 Technical Report cover Publish GitHub Repo stars note

Kai Dong

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kai Hu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kaige Gao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kan Zhu

Meta Title Cover Publish Code Note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note

Kang Guan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kang Zhao

Meta Title Cover Publish Code Note
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note

Ke Hong

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Kehong Yuan

Meta Title Cover Publish Code Note
KVSink KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs cover Publish note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note

Kexin Huang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kuai Yu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Kurt Keutzer

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Lean Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Lecong Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Lefei Zhang

Meta Title Cover Publish Code Note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note

Lei Chen

Meta Title Cover Publish Code Note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note

Lei Xu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Leyi Xia

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Li Dong

Meta Title Cover Publish Code Note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Li-Wen Chang

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Lian Liu

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Liang Zhao

Meta Title Cover Publish Code Note
SparseLLM SparseLLM: Towards Global Pruning for Pre-trained Language Models Publish GitHub Repo stars note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Lianmin Zheng

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note

Lili Qiu

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note

Litong Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Liyue Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Lu Hou

Meta Title Cover Publish Code Note
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note

Mao Yang

Meta Title Cover Publish Code Note
Compresso Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models cover Publish GitHub Repo stars note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Maosong Sun

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Marcos Treviso

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note

Mark Kurtz

Meta Title Cover Publish Code Note
m Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference Publish
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Mehrdad Farajtabar

Meta Title Cover Publish Code Note
LLM in a flash LLM in a flash: Efficient Large Language Model Inference with Limited Memory cover Publish note
ReLU Strikes Back ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models cover Publish GitHub Repo stars

Meng Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Mengdi Wang

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Miaojun Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Michael Goin

Meta Title Cover Publish Code Note
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Michael Hassid

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note

Michael W. Mahoney

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note

Mingchuan Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Minghua Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Minghui Tang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Mingjie Sun

Meta Title Cover Publish Code Note
GBLM-Pruner Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models cover Publish GitHub Repo stars note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note

Mingming Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Minmin Sun

Meta Title Cover Publish Code Note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note

Ning Tian

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ningxin Zheng

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Nipun Kwatra

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
TokenWeave TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cover Publish GitHub Repo stars note

Panpan Huang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Pavlo Molchanov

Meta Title Cover Publish Code Note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note

Peiyi Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Peng Sun

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note

Peng Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Pengcheng He

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note

Pengfei Zuo

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note

Pengle Zhang

Meta Title Cover Publish Code Note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note

Qi Hou

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Qianchao Zhu

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Qiancheng Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Qianhui Wu

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

Qihao Zhu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Qinyu Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note

Qiushi Du

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

R. J. Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

R. L. Jin

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ramachandran Ramjee

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note
TokenWeave TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cover Publish GitHub Repo stars note

Ramya Prabhu

Meta Title Cover Publish Code Note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note

Rayan Saab

Meta Title Cover Publish Code Note
GPFQ A Greedy Algorithm for Quantizing Neural Networks Publish GitHub Repo stars
GPFQv2 Post-training Quantization for Neural Networks with Provable Guarantees Publish GitHub Repo stars

Roy Schwartz

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note

Ruihang Lai

Meta Title Cover Publish Code Note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Ruiqi Ge

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ruisong Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ruizhe Pan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Runji Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Runxin Xu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ruoyu Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ruyi Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

S. S. Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Sangmin Bae

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Saurav Muralidharan

Meta Title Cover Publish Code Note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note

Sean Lie

Meta Title Cover Publish Code Note
Sparse-IFT Sparse Iso-FLOP Transformations for Maximizing Training Efficiency Publish GitHub Repo stars
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Sehoon Kim

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note

Shang Yang

Meta Title Cover Publish Code Note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Shanghao Lu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shangyan Zhou

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shanhuang Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shaoqing Wu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shengen Yan

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note

Shengfeng Ye

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish note None
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish note None
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish note None
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish note None

Shijie Cao

Meta Title Cover Publish Code Note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Shirong Ma

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shiwei Liu

Meta Title Cover Publish Code Note
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars

Shiyao Li

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note

Shiyu Chang

Meta Title Cover Publish Code Note
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note

Shiyu Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shreyas Saxena

Meta Title Cover Publish Code Note
SPDF SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models Publish
Sparse-IFT Sparse Iso-FLOP Transformations for Maximizing Training Efficiency Publish GitHub Repo stars
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note

Shuang Zhou

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shuiping Yu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shunfeng Zhou

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Shuo Yang

Meta Title Cover Publish Code Note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Shuting Pan

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Size Zheng

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Song Han

Meta Title Cover Publish Code Note
Deep Compression Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Publish
DSD DSD: Dense-Sparse-Dense Training for Deep Neural Networks Publish
SparseViT SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer cover Publish GitHub Repo stars note
TorchSparse++ TorchSparse++: Efficient Point Cloud Engine Publish GitHub Repo stars
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Stephanie Wang

Meta Title Cover Publish Code Note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Surin Ahn

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

T. Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Tal Schuster

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Tao Xie

Meta Title Cover Publish Code Note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note

Tao Yuan

Meta Title Cover Publish Code Note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note

Tao Yun

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Tian Pei

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Tianle Cai

Meta Title Cover Publish Code Note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Tianlong Chen

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note

Tianqi Chen

Meta Title Cover Publish Code Note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Tianqi Wu

Meta Title Cover Publish Code Note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Tianyu Fu

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note

Tianyu Gao

Meta Title Cover Publish Code Note
MeZO Fine-Tuning Language Models with Just Forward Passes Publish GitHub Repo stars note
LLM-shearing Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning cover Publish GitHub Repo stars note

Tianyu Sun

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Tianzhu Ye

Meta Title Cover Publish Code Note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Tim Dettmers

Meta Title Cover Publish Code Note
QLoRA QLoRA: Efficient Finetuning of Quantized LLMs cover Publish GitHub Repo stars
SpQR SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Publish GitHub Repo stars

Ting Cao

Meta Title Cover Publish Code Note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Tong Yang

Meta Title Cover Publish Code Note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note

Torsten Hoefler

Meta Title Cover Publish Code Note
m Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Publish
VENOM VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores cover Publish GitHub Repo stars note
SliceGPT SliceGPT: Compress Large Language Models by Deleting Rows and Columns cover Publish GitHub Repo stars note

Tri Dao

Meta Title Cover Publish Code Note
FlashAttention FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness cover Publish GitHub Repo stars
Flash-Decoding Flash-Decoding for long-context inference Publish note
FlashAttention-2 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Publish GitHub Repo stars
GLA Hardware-Efficient Attention for Fast Decoding cover Publish GitHub Repo stars note

Tuo Zhao

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note

Vithursan Thangarasa

Meta Title Cover Publish Code Note
SPDF SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models Publish
Sparse-IFT Sparse Iso-FLOP Transformations for Maximizing Training Efficiency Publish GitHub Repo stars
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note

W. L. Xiao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wangding Zeng

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wanjia Zhao

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wei An

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wei Lin

Meta Title Cover Publish Code Note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note

Wei Wang

Meta Title Cover Publish Code Note
BRECQ BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction Publish GitHub Repo stars
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note

Weilin Zhao

Meta Title Cover Publish Code Note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Weiyu Huang

Meta Title Cover Publish Code Note
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note

Weizhu Chen

Meta Title Cover Publish Code Note
LoRA LoRA: Low-rank adaptation of large language models cover Publish GitHub Repo stars
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note

Wen Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wenfeng Liang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wenjun Gao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wenlei Bao

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Wenqin Yu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Wentao Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Woosuk Kwon

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
APEX APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving Publish GitHub Repo stars note

Wulong Liu

Meta Title Cover Publish Code Note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note

X. Q. Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiafei Qiu

Meta Title Cover Publish Code Note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note

Xiandong Zhao

Meta Title Cover Publish Code Note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Xiang Liu

Meta Title Cover Publish Code Note
LISA LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Publish note
ChunkKV ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference cover Publish note

Xiangyu Zhang

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Xiangyue Jin

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xianzhi Yu

Meta Title Cover Publish Code Note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note

Xianzu Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiao Bi

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaodong Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaohan Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaojin Shen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaokang Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaokang Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaosha Chen

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaotao Nie

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaowei Li

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Xiaowen Sun

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiaoxiang Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xin Chen

Meta Title Cover Publish Code Note
QA-LoRA QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models cover Publish GitHub Repo stars note
LaRoSA La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation cover Publish note

Xin Cheng

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xin Jin

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note

Xin Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Xin Xie

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinchao Wang

Meta Title Cover Publish Code Note
LLM-Pruner LLM-Pruner: On the Structural Pruning of Large Language Models cover Publish GitHub Repo stars note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note

Xingchao Liu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xingcheng Zhang

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Xingkai Yu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinnan Song

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinxia Shan

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinyi Zhou

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinyu Yang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xinyu Zhou

Meta Title Cover Publish Code Note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note

Xinyuan Li

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xiuhong Li

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
m A Survey on Efficient Inference for Large Language Models cover Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Xu Han

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Xu Owen He

Meta Title Cover Publish Code Note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note

Xuecheng Su

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Xuefei Ning

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note

Xuegui Zheng

Meta Title Cover Publish Code Note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Xufang Luo

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note

Xuheng Lin

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Y. K. Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Y. Q. Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Y. Wu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note

Y. X. Wei

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Y. X. Zhu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yang Li

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
ChunkAttention ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition cover Publish GitHub Repo stars note

Yang Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yanhong Xu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish note None
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish note None
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish note None
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish note None

Yankai Lin

Meta Title Cover Publish Code Note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Yann Le Cun

Meta Title Cover Publish Code Note
OBD Optimal Brain Damage Publish
OBD Optimal Brain Damage Publish

Yanping Huang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yao Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yao Zhao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yaofeng Sun

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yaohui Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yaohui Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yehui Tang

Meta Title Cover Publish Code Note
SlimLLM SlimLLM: Accurate Structured Pruning for Large Language Models Publish note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note

Yi Yu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yi Zheng

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yichao Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yifan Shi

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yikai Zhang

Meta Title Cover Publish Code Note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

Yiliang Xiong

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yilong Zhao

Meta Title Cover Publish Code Note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note

Ying He

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ying Sheng

Meta Title Cover Publish Code Note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note

Ying Tang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yingfa Chen

Meta Title Cover Publish Code Note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note

Yinhe Han

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Yishi Piao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yisong Wang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yiwu Yao

Meta Title Cover Publish Code Note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note
AmberPruner Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Publish note

Yixiao Li

Meta Title Cover Publish Code Note
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note

Yixin Dong

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note

Yixin Song

Meta Title Cover Publish Code Note
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note

Yixuan Tan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yiyang Ma

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yiyuan Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yiyuan Ma

Meta Title Cover Publish Code Note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note

Yizhao Gao

Meta Title Cover Publish Code Note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Yong Li

Meta Title Cover Publish Code Note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note

Yongqiang Guo

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yu Cheng

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Yu Wang

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Yu Wu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuan Ou

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuandong Tian

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note

Yuchen Zhu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yucheng Li

Meta Title Cover Publish Code Note
Selective Context Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering cover Publish GitHub Repo stars
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

Yuduan Wang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yue Gong

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuezhou Hu

Meta Title Cover Publish Code Note
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note

Yuheng Zou

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuhui Xu

Meta Title Cover Publish Code Note
QA-LoRA QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models cover Publish GitHub Repo stars note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Yujia He

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yujun Lin

Meta Title Cover Publish Code Note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

Yukun Zha

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yulhwa Kim

Meta Title Cover Publish Code Note
L4Q L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ cover Publish note
FastKV FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation cover Publish GitHub Repo stars note

Yunfan Xiong

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yunhe Wang

Meta Title Cover Publish Code Note
SlimLLM SlimLLM: Accurate Structured Pruning for Large Language Models Publish note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note

Yunxian Ma

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuqing Xia

Meta Title Cover Publish Code Note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Yuqing Yang

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note

Yutao Sun

Meta Title Cover Publish Code Note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Yuting Yan

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuxiang Luo

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuxiang You

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuxin Wu

Meta Title Cover Publish Code Note
AVSS AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis cover Publish note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note

Yuxiong He

Meta Title Cover Publish Code Note
ZeroQuant ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Publish GitHub Repo stars
ZeroQuant-V2 ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation Publish GitHub Repo stars

Yuxuan Li

Meta Title Cover Publish Code Note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Yuxuan Liu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Yuyang Zhou

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Z. F. Wu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Z. Z. Ren

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zefan Cai

Meta Title Cover Publish Code Note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note
KVCache-Factory Unified KV Cache Compression Methods for Auto-Regressive Models Publish GitHub Repo stars note

Zehui Ren

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zeyu Mi

Meta Title Cover Publish Code Note
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note

Zhangli Sha

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhangyang Wang

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note

Zhe Fu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhean Xu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhen Dong

Meta Title Cover Publish Code Note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

Zhen Huang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhen Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhenda Xie

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhengyan Zhang

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhenyu Zhang

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note

Zhewei Yao

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
ZeroQuant ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Publish GitHub Repo stars
ZeroQuant-V2 ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation Publish GitHub Repo stars

Zhewen Hao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhibin Gou

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhicheng Ma

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhigang Yan

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhihang Yuan

Meta Title Cover Publish Code Note
RPTQ RPTQ: Reorder-based Post-training Quantization for Large Language Models Publish GitHub Repo stars note
m A Survey on Efficient Inference for Large Language Models cover Publish note

Zhihong Shao

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhilin Yang

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note

Zhipeng Xu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhixuan Lin

Meta Title Cover Publish Code Note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note

Zhiyu Wu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhiyuan Liu

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Zhongyu Zhang

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zhou Yu

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note

Zhuang Liu

Meta Title Cover Publish Code Note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note

Zhuomin He

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note

Zhuoshu Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zihan Wang

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note

Zihao Ye

Meta Title Cover Publish Code Note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Ziheng Jiang

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

Zihui Gu

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zijia Zhu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zijun Liu

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zili Wang

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Zilin Li

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ziqing Yang

Meta Title Cover Publish Code Note
TextPruner TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models cover Publish GitHub Repo stars
GRAIN Gradient-based Intra-attention Pruning on Pre-trained Language Models cover Publish GitHub Repo stars note

Ziwei Ji

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Ziwei Xie

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zixiao Huang

Meta Title Cover Publish Code Note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Zixuan Zhou

Meta Title Cover Publish Code Note
m A Survey on Efficient Inference for Large Language Models cover Publish note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Ziyang Song

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Ziyi Gao

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zizheng Pan

Meta Title Cover Publish Code Note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

Zuchao Li

Meta Title Cover Publish Code Note
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note

Zunhai Su

Meta Title Cover Publish Code Note
KVSink KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs cover Publish note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note