institution

AWS AI Labs

Meta Title Cover Publish Code Note
m Structural Pruning of Large Language Models via Neural Architecture Search cover Publish GitHub Repo stars

Advanced Micro Devices

Meta Title Cover Publish Code Note
T3 T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives cover Publish note
SDS Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Alibaba Cloud

Meta Title Cover Publish Code Note
07NWF4VE Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching cover Publish note

Alibaba Group

Meta Title Cover Publish Code Note
SlimGPT SlimGPT: Layer-wise Structured Pruning for Large Language Models cover Publish note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
LaRoSA La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation cover Publish note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note

Apple

Meta Title Cover Publish Code Note
LLM in a flash LLM in a flash: Efficient Large Language Model Inference with Limited Memory cover Publish note
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note
ReLU Strikes Back ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models cover Publish GitHub Repo stars
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note

Beihang University

Meta Title Cover Publish Code Note
SMP Pruning Pre-trained Language Models Without Fine-Tuning cover Publish GitHub Repo stars

ByteDance

Meta Title Cover Publish Code Note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
PoD Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity cover Publish note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note

ByteDance Seed

Meta Title Cover Publish Code Note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
TileLink TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives cover Publish GitHub Repo stars note
Triton-distributed Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler cover Publish GitHub Repo stars note

CMU

Meta Title Cover Publish Code Note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note

CPII under InnoHK

Meta Title Cover Publish Code Note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Carnegie Mellon University

Meta Title Cover Publish Code Note
GBLM-Pruner Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models cover Publish GitHub Repo stars note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
Bonsa Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes Publish GitHub Repo stars
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
ShadowKV ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

CentML

Meta Title Cover Publish Code Note
Seesaw Seesaw: High-throughput LLM Inference via Model Re-sharding cover Publish note

Center for Advanced AI

Meta Title Cover Publish Code Note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note

Central South University

Meta Title Cover Publish Code Note
07NWF4VE Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching cover Publish note

Cerebras Systems

Meta Title Cover Publish Code Note
SPDF SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models Publish
Sparse-IFT Sparse Iso-FLOP Transformations for Maximizing Training Efficiency Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Chinese Academy of Sciences

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note

Chinese University of Hong Kong

Meta Title Cover Publish Code Note
068ZPAME A Survey on Inference Optimization Techniques for Mixture of Experts Models cover Publish GitHub Repo stars note

Chongqing University

Meta Title Cover Publish Code Note
GBDT Pruning Large Language Models via Accuracy Predictor cover Publish

City University of Hong Kong

Meta Title Cover Publish Code Note
SMAT Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts cover Publish GitHub Repo stars note
CHESS Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification Publish Pytorch note

Cohere

Meta Title Cover Publish Code Note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
sparse-frontier The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs cover Publish GitHub Repo stars note

Comenius University

Meta Title Cover Publish Code Note
ADMM-pruning Fast and Effective Weight Update for Pruned Large Language Models Publish GitHub Repo stars note

Computer Network Information Center, Chinese Academy of Sciences

Meta Title Cover Publish Code Note
Acc-SpMM Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores cover Publish note

Cornell University

Meta Title Cover Publish Code Note
Movement Pruning Movement Pruning: Adaptive Sparsity by Fine-Tuning cover Publish GitHub Repo stars
QuIP QuIP: Quantization with Incoherence Processing Publish GitHub Repo stars
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note
ShadowLLM ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models cover Publish GitHub Repo stars note

DENSO IT Lab

Meta Title Cover Publish Code Note
SAS SAS: Structured Activation Spasification cover Publish GitHub Repo stars note

DeepAuto.ai

Meta Title Cover Publish Code Note
DeltaAttention Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction cover Publish note

DeepMind

Meta Title Cover Publish Code Note
m Fast Sparse ConvNets Publish GitHub Repo stars

DeepSeek-AI

Meta Title Cover Publish Code Note
DeepSeek-V2 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cover Publish GitHub Repo stars note
DeepSeek-V3 DeepSeek-V3 Technical Report cover Publish GitHub Repo stars note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning cover Publish GitHub Repo stars note

DeepSpeed

Meta Title Cover Publish Code Note
Domino Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping cover Publish GitHub Repo stars note

Delft University of Technology

Meta Title Cover Publish Code Note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note

Duke University

Meta Title Cover Publish Code Note
CoreInfer CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation cover Publish GitHub Repo stars note

ETH Zurich

Meta Title Cover Publish Code Note
VENOM VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores cover Publish GitHub Repo stars note
SliceGPT SliceGPT: Compress Large Language Models by Deleting Rows and Columns cover Publish GitHub Repo stars note

Eindhoven University of Technology

Meta Title Cover Publish Code Note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars

Emory University

Meta Title Cover Publish Code Note
SparseLLM SparseLLM: Towards Global Pruning for Pre-trained Language Models Publish GitHub Repo stars note

FAIR

Meta Title Cover Publish Code Note
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note

Fairleigh Dickinson University

Meta Title Cover Publish Code Note
MIRAGE MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Publish note

Fudan University

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note
ReAttention ReAttention: Training-Free Infinite Context with Finite Attention Scope cover Publish GitHub Repo stars note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note

Gaoling School of Artificial Intelligence, Renmin University of China

Meta Title Cover Publish Code Note
m A Survey on Model Compression for Large Language Models cover Publish

Georgia Institute of Technology

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note
1DZIJVBI Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications cover Publish note

Google

Meta Title Cover Publish Code Note
m Fast Sparse ConvNets Publish GitHub Repo stars
Sprint Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation Publish
XLA Overlap Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models cover Publish note
m The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers Publish
KCM Gradient-Free Structured Pruning with Unlabeled Data cover Publish
ShadowLLM ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models cover Publish GitHub Repo stars note

Google Cloud

Meta Title Cover Publish Code Note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Google DeepMind

Meta Title Cover Publish Code Note
MoD Mixture-of-Depths: Dynamically allocating compute in transformer-based language models cover Publish note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Google Research

Meta Title Cover Publish Code Note
FrameQuant FrameQuant: Flexible Low-Bit Quantization for Transformers cover Publish GitHub Repo stars note
OSSCAR OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars
Bonsa Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes Publish GitHub Repo stars
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

Graphcore Research

Meta Title Cover Publish Code Note
SparQ SparQ Attention: Bandwidth-Efficient LLM Inference Publish note

HKUST

Meta Title Cover Publish Code Note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Habana Labs

Meta Title Cover Publish Code Note
MVUE Minimum Variance Unbiased N:M Sparsity for the Neural Gradients Publish

Harbin Institute of Technology

Meta Title Cover Publish Code Note
TextPruner TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models cover Publish GitHub Repo stars
GRAIN Gradient-based Intra-attention Pruning on Pre-trained Language Models cover Publish GitHub Repo stars note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note

Harvard University

Meta Title Cover Publish Code Note
SMAT Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts cover Publish GitHub Repo stars note

Heriot-Watt University

Meta Title Cover Publish Code Note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note

Hong Kong University of Science and Technology

Meta Title Cover Publish Code Note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note

Houmo AI

Meta Title Cover Publish Code Note
RPTQ RPTQ: Reorder-based Post-training Quantization for Large Language Models Publish GitHub Repo stars note

Huawei

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
QA-LoRA QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models cover Publish GitHub Repo stars note
AmberPruner Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Publish note
PanguUltra Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs Publish note

Huawei Cloud

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note

Huawei Noah's Ark Lab

Meta Title Cover Publish Code Note
SIMPLE Structured Pruning for Efficient Generative Pre-trained Language Models cover Publish note
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
ReAttention ReAttention: Training-Free Infinite Context with Finite Attention Scope cover Publish GitHub Repo stars note
SlimLLM SlimLLM: Accurate Structured Pruning for Large Language Models Publish note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
FreqKV FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension cover Publish note

Huawei Technologies

Meta Title Cover Publish Code Note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note

Huazhong University of Science and Technology

Meta Title Cover Publish Code Note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Hugging Face

Meta Title Cover Publish Code Note
Movement Pruning Movement Pruning: Adaptive Sparsity by Fine-Tuning cover Publish GitHub Repo stars

IST

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish

IST Austria

Meta Title Cover Publish Code Note
m Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference Publish
m Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Publish
SPDY SPDY: Accurate Pruning with Speedup Guarantees cover Publish GitHub Repo stars note
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
SpQR SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Publish GitHub Repo stars
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

Imperial College London

Meta Title Cover Publish Code Note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note

Indian Institute of Science

Meta Title Cover Publish Code Note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note

Infinigence-AI

Meta Title Cover Publish Code Note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note

Institute for Advanced Algorithms Research

Meta Title Cover Publish Code Note
SEAP SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models cover Publish GitHub Repo stars note

Institute of Automation, Chinese Academy of Sciences

Meta Title Cover Publish Code Note
FLAP Fluctuation-based Adaptive Structured Pruning for Large Language Models cover Publish GitHub Repo stars
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Institute of Computing Technology

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note
BaWA BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation cover Publish note

Institute of Computing Technology, Chinese Academy of Sciences

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note

Institute of Information Engineering, Chinese Academy of Sciences

Meta Title Cover Publish Code Note
m A Survey on Model Compression for Large Language Models cover Publish

Intel

Meta Title Cover Publish Code Note
SCAP Post-Training Statistical Calibration for Higher Activation Sparsity cover Publish GitHub Repo stars note

Intel Corporation

Meta Title Cover Publish Code Note
OpenVINO Post-training deep neural network pruning via layer-wise calibration Publish

Intellifusion Inc.

Meta Title Cover Publish Code Note
HCAttention HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs cover Publish note

KAIST

Meta Title Cover Publish Code Note
UC0D8DJ6 Characterizing Communication Patterns in Distributed Large Language Model Inference Publish note
1DZIJVBI Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications cover Publish note
DeltaAttention Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction cover Publish note

KAIST AI

Meta Title Cover Publish Code Note
RecursiveTransformers Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA cover Publish note
MoR Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation cover Publish GitHub Repo stars note

KAUST

Meta Title Cover Publish Code Note
DistributedGEMM A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems cover Publish GitHub Repo stars note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note

KTH Royal Institute of Technology

Meta Title Cover Publish Code Note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Key Laboratory of Multimedia Trusted Perception and Efficient Computing

Meta Title Cover Publish Code Note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note

Kyushu University

Meta Title Cover Publish Code Note
SharedAttention Beyond KV Caching: Shared Attention for Efficient LLMs cover Publish GitHub Repo stars note

Lanzhou University

Meta Title Cover Publish Code Note
AVSS AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis cover Publish note

Leiden University

Meta Title Cover Publish Code Note
DeltaLLM DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference cover Publish note

MBZUAI

Meta Title Cover Publish Code Note
CHESS Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification Publish Pytorch note

MIT

Meta Title Cover Publish Code Note
SparseViT SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer cover Publish GitHub Repo stars note
TorchSparse++ TorchSparse++: Efficient Point Cloud Engine Publish GitHub Repo stars
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
CLA Reducing Transformer Key-Value Cache Size with Cross-Layer Attention cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

MIT-IBM Watson AI Lab

Meta Title Cover Publish Code Note
CLA Reducing Transformer Key-Value Cache Size with Cross-Layer Attention cover Publish GitHub Repo stars note

MakerMaker AI

Meta Title Cover Publish Code Note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note

Massachusetts Institute of Technology

Meta Title Cover Publish Code Note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
OSSCAR OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization Publish GitHub Repo stars note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note

Megvii Technology

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note

Meituan

Meta Title Cover Publish Code Note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note

Meta

Meta Title Cover Publish Code Note
Wanda A Simple and Effective Pruning Approach for Large Language Models cover Publish GitHub Repo stars note
massive-activations Massive Activations in Large Language Models cover Publish GitHub Repo stars note
2ZU1IWL6 Fast and Simplex: 2-Simplicial Attention in Triton Publish note
sparse-frontier The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs cover Publish GitHub Repo stars note

Meta AI

Meta Title Cover Publish Code Note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note

Meta AI (FAIR)

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note

Meta Platforms Inc

Meta Title Cover Publish Code Note
TorchAO TorchAO: PyTorch-Native Training-to-Serving Model Optimization cover Publish GitHub Repo stars note

Michigan State University

Meta Title Cover Publish Code Note
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note

Microsoft

Meta Title Cover Publish Code Note
LoRA LoRA: Low-rank adaptation of large language models cover Publish GitHub Repo stars
ZeroQuant ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Publish GitHub Repo stars
LoSparse Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation cover Publish GitHub Repo stars
Compresso Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models cover Publish GitHub Repo stars note
LoRAShear LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery Publish
ZeroQuant-V2 ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation Publish GitHub Repo stars
ChunkAttention ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition cover Publish GitHub Repo stars note
SliceGPT SliceGPT: Compress Large Language Models by Deleting Rows and Columns cover Publish GitHub Repo stars note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
APEX APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving Publish GitHub Repo stars note
Domino Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

Microsoft Azure

Meta Title Cover Publish Code Note
LoftQ LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models cover Publish GitHub Repo stars note

Microsoft Azure AI

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars

Microsoft Research

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
m A Survey on Evaluation of Large Language Models cover Publish
EAGLE EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cover Publish GitHub Repo stars note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
vAttention vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention cover Publish GitHub Repo stars note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

Microsoft Research India

Meta Title Cover Publish Code Note
Vidur Vidur: A Large-Scale Simulation Framework For LLM Inference cover Publish GitHub Repo stars note
TokenWeave TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cover Publish GitHub Repo stars note

Mila & Universite de Montreal

Meta Title Cover Publish Code Note
FoX Forgetting Transformer: Softmax Attention with a Forget Gate Publish GitHub Repo stars note
ACP Adaptive Computation Pruning for the Forgetting Transformer Publish GitHub Repo stars note

MiniCPM

Meta Title Cover Publish Code Note
MiniCPM4 MiniCPM4: Ultra-Efficient LLMs on End Devices cover Publish GitHub Repo stars note

Ministry of Education of China, Xiamen University

Meta Title Cover Publish Code Note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note

Mohamed bin Zayed University of AI

Meta Title Cover Publish Code Note
GBLM-Pruner Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models cover Publish GitHub Repo stars note

Moonshot AI

Meta Title Cover Publish Code Note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note

Multimedia Laboratory (MMLab)

Meta Title Cover Publish Code Note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

NVIDIA

Meta Title Cover Publish Code Note
m Channel Permutations for N:M Sparsity Publish GitHub Repo stars
NMSparse Accelerating Sparse Deep Neural Networks Publish
FT FasterTransformer Publish GitHub Repo stars
DistributedGEMM A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems cover Publish GitHub Repo stars note
streaming-llm Efficient Streaming Language Models with Attention Sinks cover Publish GitHub Repo stars note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
Minitron Compact Language Models via Pruning and Knowledge Distillation cover Publish GitHub Repo stars note
DuoAttention DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cover Publish GitHub Repo stars note
QServe QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Publish Pytorch note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
StarAttention Star Attention: Efficient LLM Inference over Long Sequences cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note
HelixParallelism Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding Publish note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note
kvpress kvpress: LLM KV cache compression made easy Publish GitHub Repo stars note

NVIDIA Research

Meta Title Cover Publish Code Note
LIMINAL Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need Publish note

NanKai University

Meta Title Cover Publish Code Note
Task-KV Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads cover Publish note

Nanjing University

Meta Title Cover Publish Code Note
STA An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers Publish
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note

Nanyang Technological University

Meta Title Cover Publish Code Note
L-OBS Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon Publish GitHub Repo stars
SMAT Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts cover Publish GitHub Repo stars note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note

National University of Singapore

Meta Title Cover Publish Code Note
LLM-Pruner LLM-Pruner: On the Structural Pruning of Large Language Models cover Publish GitHub Repo stars note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
MaskLLM MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models cover Publish GitHub Repo stars note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note

Neural Magic

Meta Title Cover Publish Code Note
m Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference Publish
SPDY SPDY: Accurate Pruning with Speedup Guarantees cover Publish GitHub Repo stars note
OBC Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Publish GitHub Repo stars
oBERT The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models Publish GitHub Repo stars
GPTQ GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Publish GitHub Repo stars
SparseGPT SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. Publish GitHub Repo stars
ZipLM ZipLM: Inference-Aware Structured Pruning of Language Models cover Publish GitHub Repo stars
SpQR SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Publish GitHub Repo stars
SquareHead Sparse Fine-tuning for Inference Acceleration of Large Language Models cover Publish GitHub Repo stars
m Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Publish GitHub Repo stars note

New York University

Meta Title Cover Publish Code Note
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note

Noah’s Ark Lab, Huawei Technologies

Meta Title Cover Publish Code Note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note

Normal Computing

Meta Title Cover Publish Code Note
m Efficient Guided Generation for Large Language Models cover Publish note

North China Electric Power University

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note

Northeastern University

Meta Title Cover Publish Code Note
ADMM-pruning A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers Publish GitHub Repo stars

Northwestern University

Meta Title Cover Publish Code Note
SR-STE Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch cover Publish GitHub Repo stars

Numenta

Meta Title Cover Publish Code Note
Complementary Sparsity Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks cover Publish note

OPPO Research Institute

Meta Title Cover Publish Code Note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note

Ohio State University

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note
CoreInfer CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation cover Publish GitHub Repo stars note

OpenAI

Meta Title Cover Publish Code Note
blocksparse GPU Kernels for Block-Sparse Weights Publish GitHub Repo stars

OpenGVLab

Meta Title Cover Publish Code Note
OmniQuant OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models cover Publish GitHub Repo stars

OpenTeams Inc

Meta Title Cover Publish Code Note
TorchAO TorchAO: PyTorch-Native Training-to-Serving Model Optimization cover Publish GitHub Repo stars note

Oxford University

Meta Title Cover Publish Code Note
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note

Peking University

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
EAGLE EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cover Publish GitHub Repo stars note
m A Survey on Efficient Inference for Large Language Models cover Publish note
FLUX FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion cover Publish note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
RaaS Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity cover Publish note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note
MegaScale-MoE MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production cover Publish note
SALE SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Perplexity AI

Meta Title Cover Publish Code Note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

Princeton University

Meta Title Cover Publish Code Note
AdaLoRA AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cover Publish GitHub Repo stars
MeZO Fine-Tuning Language Models with Just Forward Passes Publish GitHub Repo stars note
LLM-shearing Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning cover Publish GitHub Repo stars note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note
GLA Hardware-Efficient Attention for Fast Decoding cover Publish GitHub Repo stars note

Purdue University

Meta Title Cover Publish Code Note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note

PyTorch

Meta Title Cover Publish Code Note
Async-TP [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch cover Publish GitHub Repo stars note

Qwen Team

Meta Title Cover Publish Code Note
Qwen3 Qwen3 Technical Report cover Publish GitHub Repo stars note

Renmin University of China

Meta Title Cover Publish Code Note
Acc-SpMM Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores cover Publish note
SEAP SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models cover Publish GitHub Repo stars note

Rice University

Meta Title Cover Publish Code Note
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars

RiseAI-Sys

Meta Title Cover Publish Code Note
attention-gym Attention-Gym: Triton-Based Sparse and Quantization Attention Publish GitHub Repo stars note

SJTU

Meta Title Cover Publish Code Note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
LServer LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention cover Publish GitHub Repo stars note

Salesforce AI Research

Meta Title Cover Publish Code Note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Salesforce Research

Meta Title Cover Publish Code Note
topk-decoding Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs cover Publish GitHub Repo stars note

Samsung

Meta Title Cover Publish Code Note
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note

Samsung AI Center

Meta Title Cover Publish Code Note
TinyTrain TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge Publish GitHub Repo stars note

Santa Clara University

Meta Title Cover Publish Code Note
SparseInfer SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference Publish note

School of Cyber Security, University of Chinese Academy of Sciences

Meta Title Cover Publish Code Note
m A Survey on Model Compression for Large Language Models cover Publish

SenseTime

Meta Title Cover Publish Code Note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note

SenseTime Research

Meta Title Cover Publish Code Note
BRECQ BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction Publish GitHub Repo stars
SR-STE Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch cover Publish GitHub Repo stars

Seoul National University

Meta Title Cover Publish Code Note
K-pruning Knowledge-preserving Pruning for Pre-trained Language Models without Retraining cover Publish note
L4Q L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ cover Publish note
FastKV FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation cover Publish GitHub Repo stars note

Shanghai AI Lab

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note

Shanghai AI Laboratory

Meta Title Cover Publish Code Note
0VRXJQ3F Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Shanghai Artificial Intelligence Laboratory

Meta Title Cover Publish Code Note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note

Shanghai Artificial Intelligence Laboratorys

Meta Title Cover Publish Code Note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note

Shanghai Jiao Tong University

Meta Title Cover Publish Code Note
PINS Pruning Pre-trained Language Models with Principled Importance and Self-regularization Publish GitHub Repo stars
PowerInfer PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Publish GitHub Repo stars note
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
Quest Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cover Publish GitHub Repo stars note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
068ZPAME A Survey on Inference Optimization Techniques for Mixture of Experts Models cover Publish GitHub Repo stars note
DistAttention Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
PowerInfer-2 PowerInfer-2: Fast Large Language Model Inference on a Smartphone Publish Website note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
AdaSkip AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference cover Publish GitHub Repo stars note
CateKV CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration cover Publish note
PoD Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity cover Publish note
SpecEE SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting cover Publish GitHub Repo stars note
CometSeed Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts cover Publish GitHub Repo stars note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note
FreqKV FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension cover Publish note

Shanghai Jiaotong University

Meta Title Cover Publish Code Note
CachedAttention Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention cover Publish note
m A Survey on Efficient Inference for Large Language Models cover Publish note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note

ShanghaiTech University

Meta Title Cover Publish Code Note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note

Shenzhen Institutes of Advanced Technology(SIAT), Chinese Academy of Science(CAS)

Meta Title Cover Publish Code Note
AMALI AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs cover Publish note

Singapore University of Technology and Design

Meta Title Cover Publish Code Note
Cus-Prun Pruning General Large Language Models into Customized Expert Models cover Publish GitHub Repo stars note

Sogang University

Meta Title Cover Publish Code Note
SparseInfer SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference Publish note

Soochow University

Meta Title Cover Publish Code Note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

Stanford

Meta Title Cover Publish Code Note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note

Stanford University

Meta Title Cover Publish Code Note
Deep Compression Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Publish
DSD DSD: Dense-Sparse-Dense Training for Deep Neural Networks Publish
FlashAttention FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness cover Publish GitHub Repo stars
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
Flash-Decoding Flash-Decoding for long-context inference Publish note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note
FlashAttention-2 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Publish GitHub Repo stars
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
MoSA Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing cover Publish GitHub Repo stars note
Seesaw Seesaw: High-throughput LLM Inference via Model Re-sharding cover Publish note

StepFun

Meta Title Cover Publish Code Note
MFA Multi-matrix Factorization Attention Publish note

StepFun Inc.

Meta Title Cover Publish Code Note
Step-3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Publish note

Sun Yat-sen University

Meta Title Cover Publish Code Note
Adrenaline Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation cover Publish GitHub Repo stars note

Sungkyunkwan University

Meta Title Cover Publish Code Note
L4Q L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ cover Publish note

Synthesia

Meta Title Cover Publish Code Note
SparQ SparQ Attention: Bandwidth-Efficient LLM Inference Publish note

Tencent AI Lab

Meta Title Cover Publish Code Note
RPTQ RPTQ: Reorder-based Post-training Quantization for Large Language Models Publish GitHub Repo stars note

Tencent Machine Learning Platform

Meta Title Cover Publish Code Note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note

Tencent Youtu Lab

Meta Title Cover Publish Code Note
DSnoT Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs cover Publish GitHub Repo stars note

The Chinese University of Hong Kong

Meta Title Cover Publish Code Note
Centauri Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning Publish note
SPP SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models cover Publish GitHub Repo stars note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
PSA Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving cover Publish GitHub Repo stars note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

The Hebrew University of Jerusalem

Meta Title Cover Publish Code Note
TOVA Transformers are Multi-State RNNs cover Publish GitHub Repo stars note

The Hebrew University of Jerusalem, Israel

Meta Title Cover Publish Code Note
m Efficient Methods for Natural Language Processing: A Survey cover Publish

The Hong Kong Polytechnic University

Meta Title Cover Publish Code Note
PWGG5HBE A Survey on Large Language Model Acceleration based on KV Cache Management cover Publish GitHub Repo stars note

The Hong Kong University of Science and Technology

Meta Title Cover Publish Code Note
LISA LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Publish note
ChunkKV ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference cover Publish note

The Ohio State University

Meta Title Cover Publish Code Note
UC0D8DJ6 Characterizing Communication Patterns in Distributed Large Language Model Inference Publish note

The University of Hong Kong

Meta Title Cover Publish Code Note
SIMPLE Structured Pruning for Efficient Generative Pre-trained Language Models cover Publish note
OmniQuant OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models cover Publish GitHub Repo stars
FlexPrefill FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference cover Publish GitHub Repo stars note
PowerAttention PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention cover Publish note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note

The University of North Carolina

Meta Title Cover Publish Code Note
m Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Publish GitHub Repo stars note

The University of Texas at Austin

Meta Title Cover Publish Code Note
MIRAGE MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Publish note None
MIRAGE MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Publish note None
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
Recycled Attention Recycled Attention: Efficient inference for long-context language models cover Publish GitHub Repo stars note
R-Sparse R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference cover Publish GitHub Repo stars note

Together AI

Meta Title Cover Publish Code Note
TEAL Training-Free Activation Sparsity in Large Language Models cover Publish GitHub Repo stars note

Tongji University

Meta Title Cover Publish Code Note
GBDT Pruning Large Language Models via Accuracy Predictor cover Publish

Tsinghua University

Meta Title Cover Publish Code Note
Deep Compression Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Publish
nmSPARSE Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning Publish GitHub Repo stars
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
m Training Transformers with 4-bit Integers Publish GitHub Repo stars
RIA Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models cover Publish GitHub Repo stars
m Accelerating Transformer Pre-training with 2:4 Sparsity Publish GitHub Repo stars note
AWQ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Publish GitHub Repo stars
m A Survey on Efficient Inference for Large Language Models cover Publish note
m Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs cover Publish note
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cover Publish GitHub Repo stars note
MiniKV MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache cover Publish note
MoA MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression cover Publish GitHub Repo stars note
MFA Multi-matrix Factorization Attention Publish note
ProSparse ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models cover Publish GitHub Repo stars note
ReLU2 ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs cover Publish note
ReMoE ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing cover Publish GitHub Repo stars note
SageAttention2 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Publish GitHub Repo stars note
SageAttention SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note
Turbo Sparse Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters Publish Pytorch note
AdaptiveSparseTrainer Pruning Large Language Models with Semi-Structural Adaptive Sparse Training cover Publish GitHub Repo stars note
BlockFFN BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity cover Publish GitHub Repo stars note
KVSink KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs cover Publish note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note
SparsingLaw Sparsing Law: Towards Large Language Models with Greater Activation Sparsity cover Publish GitHub Repo stars note
XAttention XAttention: Block Sparse Attention with Antidiagonal Scoring cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
LinearPatch A Simple Linear Patch Revives Layer-Pruned Large Language Models cover Publish note
FlashOverlap FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation cover Publish GitHub Repo stars note
KeepKV KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference cover Publish note
LeanK LeanK: Learnable K Cache Channel Pruning for Efficient Decoding cover Publish GitHub Repo stars note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note
ReSA Rectified Sparse Attention cover Publish GitHub Repo stars note
SageAttention3 SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Publish GitHub Repo stars note
SeerAttention-R SeerAttention-R: Sparse Attention Adaptation for Long Reasoning cover Publish GitHub Repo stars note
Super-Experts-Profilling Unveiling Super Experts in Mixture-of-Experts Large Language Models cover Publish GitHub Repo stars note

UC Berkeley

Meta Title Cover Publish Code Note
ActNN ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training Publish GitHub Repo stars
FisherPruning A Fast Post-Training Pruning Framework for Transformers cover Publish GitHub Repo stars note
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention cover Publish GitHub Repo stars note
LoRA+ LoRA+: Efficient Low Rank Adaptation of Large Models Publish GitHub Repo stars note
SqueezeLLM SqueezeLLM: Dense-and-Sparse Quantization cover Publish GitHub Repo stars note
SGLang SGLang: Efficient Execution of Structured Language Model Programs cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note
DoubleSparsity Post-Training Sparse Attention with Double Sparsity Publish GitHub Repo stars note
HashAttention HashAttention: Semantic Sparsity for Faster Inference cover Publish GitHub Repo stars note

UC Santa Barbara

Meta Title Cover Publish Code Note
FlexiDepth Adaptive Layer-skipping in Pre-trained LLMs cover Publish note
IFPruning Instruction-Following Pruning for Large Language Models cover Publish note
KVLink KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse cover Publish GitHub Repo stars note

UCSD

Meta Title Cover Publish Code Note
GPFQ A Greedy Algorithm for Quantizing Neural Networks Publish GitHub Repo stars
GPFQv2 Post-training Quantization for Neural Networks with Provable Guarantees Publish GitHub Repo stars

Univeristy of Sydney

Meta Title Cover Publish Code Note
Flash-LLM Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity cover Publish GitHub Repo stars note

Universidade da Coruña

Meta Title Cover Publish Code Note
VENOM VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores cover Publish GitHub Repo stars note

Universidade de Lisboa

Meta Title Cover Publish Code Note
AdaSplash AdaSplash: Adaptive Sparse Flash Attention Publish GitHub Repo stars note

University College London

Meta Title Cover Publish Code Note
CATS CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models cover Publish GitHub Repo stars note

University of Basel

Meta Title Cover Publish Code Note
Adaptively Sparse Attention Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers cover Publish

University of California

Meta Title Cover Publish Code Note
DSA Transformer Acceleration with Dynamic Sparse Attention cover Publish note

University of California, Berkeley

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
APEX APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving Publish GitHub Repo stars note
XGrammar XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models cover Publish GitHub Repo stars note
SpargeAttn SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference cover Publish GitHub Repo stars note

University of California, Riverside

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note

University of California, San Diego

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note

University of Cambridge, United Kingdom

Meta Title Cover Publish Code Note
TinyTrain TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge Publish GitHub Repo stars note

University of Chinese Academy of Sciences

Meta Title Cover Publish Code Note
Q-Sparse Q-Sparse: All Large Language Models can be Fully Sparsely-Activated cover Publish note
COMET COMET: Towards Partical W4A4KV4 LLMs Serving cover Publish note

University of Connecticut

Meta Title Cover Publish Code Note
m Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm Publish GitHub Repo stars

University of Edinburgh

Meta Title Cover Publish Code Note
sparse-frontier The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs cover Publish GitHub Repo stars note

University of Electronic Science and Technology of China

Meta Title Cover Publish Code Note
BRECQ BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction Publish GitHub Repo stars

University of Hong Kong

Meta Title Cover Publish Code Note
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note

University of Illinois Urbana-Champaign

Meta Title Cover Publish Code Note
LISA LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Publish note
SnapKV SnapKV: LLM Knows What You are Looking for Before Generation cover Publish GitHub Repo stars note

University of Macau

Meta Title Cover Publish Code Note
Awesome-Efficient-Arch Speed Always Wins: A Survey on Efficient Architectures for Large Language Models cover Publish GitHub Repo stars note

University of Maryland

Meta Title Cover Publish Code Note
topk-decoding Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs cover Publish GitHub Repo stars note

University of Massachusetts Amherst

Meta Title Cover Publish Code Note
CoCoNet Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads cover Publish GitHub Repo stars note

University of Oxford

Meta Title Cover Publish Code Note
SMAT Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts cover Publish GitHub Repo stars note

University of Science and Technology

Meta Title Cover Publish Code Note
AMALI AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs cover Publish note

University of Science and Technology of China

Meta Title Cover Publish Code Note
AdaKV Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cover Publish GitHub Repo stars note
AttentionPredictor AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference cover Publish note
HATA HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference cover Publish GitHub Repo stars note

University of Seoul

Meta Title Cover Publish Code Note
SparseInfer SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference Publish note

University of Southern California

Meta Title Cover Publish Code Note
APEX APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving Publish GitHub Repo stars note

University of St Andrews

Meta Title Cover Publish Code Note
Mosaic Mosaic: Composite Projection Pruning for Resource-efficient LLMs cover Publish note

University of Surrey

Meta Title Cover Publish Code Note
MInference MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention cover Publish GitHub Repo stars note
SCBench SCBench: A KV Cache-Centric Analysis of Long-Context Methods Publish GitHub Repo stars note
MMInference MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Publish note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

University of Surrey, UK

Meta Title Cover Publish Code Note
Selective Context Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering cover Publish GitHub Repo stars

University of Texas at Austin

Meta Title Cover Publish Code Note
H2O HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cover Publish GitHub Repo stars note
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars
LLM-KICK Compressing LLMs: The Truth is Rarely Pure and Never Simple cover Publish GitHub Repo stars note
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars

University of Toronto

Meta Title Cover Publish Code Note
Seesaw Seesaw: High-throughput LLM Inference via Model Re-sharding cover Publish note

University of Washington

Meta Title Cover Publish Code Note
QLoRA QLoRA: Efficient Finetuning of Quantized LLMs cover Publish GitHub Repo stars
OWL Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity cover Publish GitHub Repo stars
SeerAttention SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs cover Publish GitHub Repo stars note
NSA Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cover Publish note
POD-Attention POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference cover Publish GitHub Repo stars note
NanoFlow NanoFlow: Towards Optimal Large Language Model Serving Throughput cover Publish GitHub Repo stars note
FlashInfer FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Publish GitHub Repo stars note

University of Waterloo

Meta Title Cover Publish Code Note
EAGLE EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cover Publish GitHub Repo stars note

University of Wisconsin

Meta Title Cover Publish Code Note
R-KV R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration cover Publish GitHub Repo stars note

University of Wisconsin-Madison

Meta Title Cover Publish Code Note
T3 T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives cover Publish note
FrameQuant FrameQuant: Flexible Low-Bit Quantization for Transformers cover Publish GitHub Repo stars note

VITA Group

Meta Title Cover Publish Code Note
m Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers Publish
Essential Sparsity The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Publish GitHub Repo stars

Vector Institute

Meta Title Cover Publish Code Note
EAGLE EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cover Publish GitHub Repo stars note
Seesaw Seesaw: High-throughput LLM Inference via Model Re-sharding cover Publish note

Vizuara AI Labs

Meta Title Cover Publish Code Note
MoE-MLA-RoPE Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models Publish note

Vokram Group

Meta Title Cover Publish Code Note
52A7RO95 Mixture of Experts in Large Language Models cover Publish note

WeChat AI

Meta Title Cover Publish Code Note
DBudgetKV DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance cover Publish note

Wuhan University

Meta Title Cover Publish Code Note
m Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption cover Publish GitHub Repo stars note
SIFT Sparse is Enough in Fine-tuning Pre-trained Large Language Models Publish GitHub Repo stars note
CHESS Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification Publish Pytorch note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note

Xiamen University

Meta Title Cover Publish Code Note
Compresso Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models cover Publish GitHub Repo stars note

Xiaohongshu

Meta Title Cover Publish Code Note
ZigZagKV ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty cover Publish note

Xiaomi

Meta Title Cover Publish Code Note
SpindleKV SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers cover Publish GitHub Repo stars note

Yale University

Meta Title Cover Publish Code Note
Diffuser Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences cover Publish GitHub Repo stars

Zhe Jiang University

Meta Title Cover Publish Code Note
Deja Vu Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time cover Publish GitHub Repo stars

Zhejiang University

Meta Title Cover Publish Code Note
SharePrefill Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing cover Publish note
MoBA MoBA: Mixture of Block Attention for Long-Context LLMs cover Publish GitHub Repo stars note

Zhipu.AI

Meta Title Cover Publish Code Note
CodeGeeX CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X Publish GitHub Repo stars note
SampleAttention SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention cover Publish note

Zhongguancun Laboratory

Meta Title Cover Publish Code Note
SMP Pruning Pre-trained Language Models Without Fine-Tuning cover Publish GitHub Repo stars

baidu

Meta Title Cover Publish Code Note
FlashMask FlashMask: Efficient and Rich Mask Extension of FlashAttention cover Publish GitHub Repo stars note

iFLYTEK Research

Meta Title Cover Publish Code Note
GRAIN Gradient-based Intra-attention Pruning on Pre-trained Language Models cover Publish GitHub Repo stars note

inst1

Meta Title Cover Publish Code Note
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
YS9YTT55 LLM Inference Serving: Survey of Recent Advances and Opportunities Publish note
CacheBlend CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion cover Publish GitHub Repo stars note
AhaKV AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models Publish note
CCQ CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs Publish note
209M5GA7 KV Cache Compression for Inference Efficiency in LLMs: A Review Publish note
QuickSilver QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Publish note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note

inst2

Meta Title Cover Publish Code Note
Flash-Decoding Flash-Decoding for long-context inference Publish note
Sparse-IFT Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency cover Publish GitHub Repo stars note
KVQuant KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Publish GitHub Repo stars note
YS9YTT55 LLM Inference Serving: Survey of Recent Advances and Opportunities Publish note
CacheBlend CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion cover Publish GitHub Repo stars note
AhaKV AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models Publish note
CCQ CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs Publish note
209M5GA7 KV Cache Compression for Inference Efficiency in LLMs: A Review Publish note
QuickSilver QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Publish note
RadialAttention Radial Attention: Sparse Attention with Energy Decay for Long Video Generation Publish note