institution
AWS AI Labs
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Structural Pruning of Large Language Models via Neural Architecture Search | ![]() |
Advanced Micro Devices
Alibaba Cloud
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
07NWF4VE | Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching | ![]() |
note |
Alibaba Group
Apple
Beihang University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SMP | Pruning Pre-trained Language Models Without Fine-Tuning | ![]() |
ByteDance
ByteDance Seed
CMU
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
massive-activations | Massive Activations in Large Language Models | ![]() |
note |
CPII under InnoHK
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SPP | SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models | ![]() |
note |
Carnegie Mellon University
CentML
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Seesaw | Seesaw: High-throughput LLM Inference via Model Re-sharding | ![]() |
note |
Center for Advanced AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
KVLink | KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse | ![]() |
note |
Central South University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
07NWF4VE | Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching | ![]() |
note |
Cerebras Systems
Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CoCoNet | Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads | ![]() |
note |
Chinese University of Hong Kong
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
068ZPAME | A Survey on Inference Optimization Techniques for Mixture of Experts Models | ![]() |
note |
Chongqing University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
GBDT | Pruning Large Language Models via Accuracy Predictor | ![]() |
City University of Hong Kong
Cohere
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SnapKV | SnapKV: LLM Knows What You are Looking for Before Generation | ![]() |
note | ||
sparse-frontier | The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs | ![]() |
note |
Comenius University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
ADMM-pruning | Fast and Effective Weight Update for Pruned Large Language Models | note |
Computer Network Information Center, Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Acc-SpMM | Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores | ![]() |
note |
Cornell University
DENSO IT Lab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SAS | SAS: Structured Activation Spasification | ![]() |
note |
DeepAuto.ai
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DeltaAttention | Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction | ![]() |
note |
DeepMind
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Fast Sparse ConvNets |
DeepSeek-AI
DeepSpeed
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Domino | Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping | ![]() |
note |
Delft University of Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DeltaLLM | DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference | ![]() |
note |
Duke University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CoreInfer | CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation | ![]() |
note |
ETH Zurich
Eindhoven University of Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
OWL | Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity | ![]() |
Emory University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparseLLM | SparseLLM: Towards Global Pruning for Pre-trained Language Models | note |
FAIR
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TOVA | Transformers are Multi-State RNNs | ![]() |
note |
Fairleigh Dickinson University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MIRAGE | MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving | note |
Fudan University
Gaoling School of Artificial Intelligence, Renmin University of China
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | A Survey on Model Compression for Large Language Models | ![]() |
Georgia Institute of Technology
Google Cloud
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MoR | Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation | ![]() |
note |
Google DeepMind
Google Research
Graphcore Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparQ | SparQ Attention: Bandwidth-Efficient LLM Inference | note |
HKUST
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Awesome-Efficient-Arch | Speed Always Wins: A Survey on Efficient Architectures for Large Language Models | ![]() |
note |
Habana Labs
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MVUE | Minimum Variance Unbiased N:M Sparsity for the Neural Gradients |
Harbin Institute of Technology
Harvard University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SMAT | Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts | ![]() |
note |
Heriot-Watt University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
52A7RO95 | Mixture of Experts in Large Language Models | ![]() |
note |
Hong Kong University of Science and Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
PWGG5HBE | A Survey on Large Language Model Acceleration based on KV Cache Management | ![]() |
note |
Houmo AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
RPTQ | RPTQ: Reorder-based Post-training Quantization for Large Language Models | note |
Huawei
Huawei Cloud
Huawei Noah's Ark Lab
Huawei Technologies
Huazhong University of Science and Technology
Hugging Face
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Movement Pruning | Movement Pruning: Adaptive Sparsity by Fine-Tuning | ![]() |
IST
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Efficient Methods for Natural Language Processing: A Survey | ![]() |
IST Austria
Imperial College London
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
52A7RO95 | Mixture of Experts in Large Language Models | ![]() |
note |
Indian Institute of Science
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
vAttention | vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention | ![]() |
note |
Infinigence-AI
Institute for Advanced Algorithms Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SEAP | SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models | ![]() |
note |
Institute of Automation, Chinese Academy of Sciences
Institute of Computing Technology
Institute of Computing Technology, Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
ProSparse | ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models | ![]() |
note |
Institute of Information Engineering, Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | A Survey on Model Compression for Large Language Models | ![]() |
Intel
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SCAP | Post-Training Statistical Calibration for Higher Activation Sparsity | ![]() |
note |
Intel Corporation
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
OpenVINO | Post-training deep neural network pruning via layer-wise calibration |
Intellifusion Inc.
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
HCAttention | HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs | ![]() |
note |
KAIST
KAIST AI
KAUST
KTH Royal Institute of Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Awesome-Efficient-Arch | Speed Always Wins: A Survey on Efficient Architectures for Large Language Models | ![]() |
note |
Key Laboratory of Multimedia Trusted Perception and Efficient Computing
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DSnoT | Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | ![]() |
note |
Kyushu University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SharedAttention | Beyond KV Caching: Shared Attention for Efficient LLMs | ![]() |
note |
Lanzhou University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AVSS | AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis | ![]() |
note |
Leiden University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DeltaLLM | DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference | ![]() |
note |
MBZUAI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CHESS | Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification | Pytorch | note |
MIT
MIT-IBM Watson AI Lab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CLA | Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | ![]() |
note |
MakerMaker AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
FoX | Forgetting Transformer: Softmax Attention with a Forget Gate | note | |||
ACP | Adaptive Computation Pruning for the Forgetting Transformer | note |
Massachusetts Institute of Technology
Megvii Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MFA | Multi-matrix Factorization Attention | note |
Meituan
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Super-Experts-Profilling | Unveiling Super Experts in Mixture-of-Experts Large Language Models | ![]() |
note |
Meta
Meta AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
streaming-llm | Efficient Streaming Language Models with Attention Sinks | ![]() |
note | ||
R-Sparse | R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference | ![]() |
note |
Meta AI (FAIR)
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
H2O | HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ![]() |
note |
Meta Platforms Inc
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TorchAO | TorchAO: PyTorch-Native Training-to-Serving Model Optimization | ![]() |
note |
Michigan State University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark | note |
Microsoft
Microsoft Azure
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
LoftQ | LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models | ![]() |
note |
Microsoft Azure AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AdaLoRA | AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning | ![]() |
Microsoft Research
Microsoft Research India
Mila & Universite de Montreal
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
FoX | Forgetting Transformer: Softmax Attention with a Forget Gate | note | |||
ACP | Adaptive Computation Pruning for the Forgetting Transformer | note |
MiniCPM
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MiniCPM4 | MiniCPM4: Ultra-Efficient LLMs on End Devices | ![]() |
note |
Ministry of Education of China, Xiamen University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DSnoT | Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | ![]() |
note |
Mohamed bin Zayed University of AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
GBLM-Pruner | Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models | ![]() |
note |
Moonshot AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MoBA | MoBA: Mixture of Block Attention for Long-Context LLMs | ![]() |
note |
Multimedia Laboratory (MMLab)
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SPP | SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models | ![]() |
note |
NVIDIA
NVIDIA Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
LIMINAL | Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need | note |
NanKai University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Task-KV | Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads | ![]() |
note |
Nanjing University
Nanyang Technological University
National University of Singapore
Neural Magic
New York University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Recycled Attention | Recycled Attention: Efficient inference for long-context language models | ![]() |
note |
Noah’s Ark Lab, Huawei Technologies
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AttentionPredictor | AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference | ![]() |
note |
Normal Computing
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Efficient Guided Generation for Large Language Models | ![]() |
note |
North China Electric Power University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
COMET | COMET: Towards Partical W4A4KV4 LLMs Serving | ![]() |
note |
Northeastern University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
ADMM-pruning | A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers |
Northwestern University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SR-STE | Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch | ![]() |
Numenta
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Complementary Sparsity | Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks | ![]() |
note |
OPPO Research Institute
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SharePrefill | Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing | ![]() |
note |
Ohio State University
OpenAI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
blocksparse | GPU Kernels for Block-Sparse Weights |
OpenGVLab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
OmniQuant | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | ![]() |
OpenTeams Inc
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TorchAO | TorchAO: PyTorch-Native Training-to-Serving Model Optimization | ![]() |
note |
Oxford University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CATS | CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models | ![]() |
note |
Peking University
Perplexity AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
FlashInfer | FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving | note |
Princeton University
Purdue University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
52A7RO95 | Mixture of Experts in Large Language Models | ![]() |
note |
PyTorch
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Async-TP | [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch | ![]() |
note |
Qwen Team
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Qwen3 | Qwen3 Technical Report | ![]() |
note |
Renmin University of China
Rice University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Deja Vu | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ![]() |
RiseAI-Sys
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
attention-gym | Attention-Gym: Triton-Based Sparse and Quantization Attention | note |
SJTU
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
XAttention | XAttention: Block Sparse Attention with Antidiagonal Scoring | ![]() |
note | ||
LServer | LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention | ![]() |
note |
Salesforce AI Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SPP | SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models | ![]() |
note |
Salesforce Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
topk-decoding | Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs | ![]() |
note |
Samsung
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
FisherPruning | A Fast Post-Training Pruning Framework for Transformers | ![]() |
note |
Samsung AI Center
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TinyTrain | TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge | note |
Santa Clara University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparseInfer | SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference | note |
School of Cyber Security, University of Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | A Survey on Model Compression for Large Language Models | ![]() |
SenseTime
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
0VRXJQ3F | Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving | ![]() |
note |
SenseTime Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
BRECQ | BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction | ||||
SR-STE | Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch | ![]() |
Seoul National University
Shanghai AI Lab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Centauri | Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning | note |
Shanghai AI Laboratory
Shanghai Artificial Intelligence Laboratory
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Turbo Sparse | Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters | Pytorch | note |
Shanghai Artificial Intelligence Laboratorys
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SPP | SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models | ![]() |
note |
Shanghai Jiao Tong University
Shanghai Jiaotong University
ShanghaiTech University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
COMET | COMET: Towards Partical W4A4KV4 LLMs Serving | ![]() |
note |
Shenzhen Institutes of Advanced Technology(SIAT), Chinese Academy of Science(CAS)
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AMALI | AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs | ![]() |
note |
Singapore University of Technology and Design
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Cus-Prun | Pruning General Large Language Models into Customized Expert Models | ![]() |
note |
Sogang University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparseInfer | SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference | note |
Soochow University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Awesome-Efficient-Arch | Speed Always Wins: A Survey on Efficient Architectures for Large Language Models | ![]() |
note |
Stanford
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DoubleSparsity | Post-Training Sparse Attention with Double Sparsity | note |
Stanford University
StepFun
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MFA | Multi-matrix Factorization Attention | note |
StepFun Inc.
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Step-3 | Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding | note |
Sun Yat-sen University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Adrenaline | Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation | ![]() |
note |
Sungkyunkwan University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
L4Q | L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ | ![]() |
note |
Synthesia
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparQ | SparQ Attention: Bandwidth-Efficient LLM Inference | note |
Tencent AI Lab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
RPTQ | RPTQ: Reorder-based Post-training Quantization for Large Language Models | note |
Tencent Machine Learning Platform
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
ProSparse | ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models | ![]() |
note |
Tencent Youtu Lab
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DSnoT | Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | ![]() |
note |
The Chinese University of Hong Kong
The Hebrew University of Jerusalem
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TOVA | Transformers are Multi-State RNNs | ![]() |
note |
The Hebrew University of Jerusalem, Israel
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Efficient Methods for Natural Language Processing: A Survey | ![]() |
The Hong Kong Polytechnic University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
PWGG5HBE | A Survey on Large Language Model Acceleration based on KV Cache Management | ![]() |
note |
The Hong Kong University of Science and Technology
The Ohio State University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
UC0D8DJ6 | Characterizing Communication Patterns in Distributed Large Language Model Inference | note |
The University of Hong Kong
The University of North Carolina
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark | note |
The University of Texas at Austin
Together AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TEAL | Training-Free Activation Sparsity in Large Language Models | ![]() |
note |
Tongji University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
GBDT | Pruning Large Language Models via Accuracy Predictor | ![]() |
Tsinghua University
UC Berkeley
UC Santa Barbara
UCSD
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
GPFQ | A Greedy Algorithm for Quantizing Neural Networks | ||||
GPFQv2 | Post-training Quantization for Neural Networks with Provable Guarantees |
Univeristy of Sydney
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Flash-LLM | Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | ![]() |
note |
Universidade da Coruña
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
VENOM | VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores | ![]() |
note |
Universidade de Lisboa
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AdaSplash | AdaSplash: Adaptive Sparse Flash Attention | note |
University College London
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CATS | CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models | ![]() |
note |
University of Basel
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Adaptively Sparse Attention | Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers | ![]() |
University of California
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DSA | Transformer Acceleration with Dynamic Sparse Attention | ![]() |
note |
University of California, Berkeley
University of California, Riverside
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CoCoNet | Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads | ![]() |
note |
University of California, San Diego
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
H2O | HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ![]() |
note |
University of Cambridge, United Kingdom
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
TinyTrain | TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge | note |
University of Chinese Academy of Sciences
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Q-Sparse | Q-Sparse: All Large Language Models can be Fully Sparsely-Activated | ![]() |
note | ||
COMET | COMET: Towards Partical W4A4KV4 LLMs Serving | ![]() |
note |
University of Connecticut
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
m | Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm |
University of Edinburgh
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
sparse-frontier | The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs | ![]() |
note |
University of Electronic Science and Technology of China
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
BRECQ | BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction |
University of Hong Kong
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SeerAttention | SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs | ![]() |
note |
University of Illinois Urbana-Champaign
University of Macau
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Awesome-Efficient-Arch | Speed Always Wins: A Survey on Efficient Architectures for Large Language Models | ![]() |
note |
University of Maryland
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
topk-decoding | Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs | ![]() |
note |
University of Massachusetts Amherst
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
CoCoNet | Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads | ![]() |
note |
University of Oxford
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SMAT | Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts | ![]() |
note |
University of Science and Technology
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
AMALI | AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs | ![]() |
note |
University of Science and Technology of China
University of Seoul
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SparseInfer | SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference | note |
University of Southern California
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
APEX | APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving | note |
University of St Andrews
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Mosaic | Mosaic: Composite Projection Pruning for Resource-efficient LLMs | ![]() |
note |
University of Surrey
University of Surrey, UK
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Selective Context | Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering | ![]() |
University of Texas at Austin
University of Toronto
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Seesaw | Seesaw: High-throughput LLM Inference via Model Re-sharding | ![]() |
note |
University of Washington
University of Waterloo
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
EAGLE | EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty | ![]() |
note |
University of Wisconsin
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
R-KV | R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration | ![]() |
note |
University of Wisconsin-Madison
VITA Group
Vector Institute
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
EAGLE | EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty | ![]() |
note | ||
Seesaw | Seesaw: High-throughput LLM Inference via Model Re-sharding | ![]() |
note |
Vizuara AI Labs
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
MoE-MLA-RoPE | Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models | note |
Vokram Group
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
52A7RO95 | Mixture of Experts in Large Language Models | ![]() |
note |
WeChat AI
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
DBudgetKV | DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance | ![]() |
note |
Wuhan University
Xiamen University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Compresso | Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models | ![]() |
note |
Xiaohongshu
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
ZigZagKV | ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty | ![]() |
note |
Xiaomi
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SpindleKV | SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers | ![]() |
note |
Yale University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Diffuser | Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences | ![]() |
Zhe Jiang University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
Deja Vu | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ![]() |
Zhejiang University
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SharePrefill | Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing | ![]() |
note | ||
MoBA | MoBA: Mixture of Block Attention for Long-Context LLMs | ![]() |
note |
Zhipu.AI
Zhongguancun Laboratory
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
SMP | Pruning Pre-trained Language Models Without Fine-Tuning | ![]() |
baidu
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
FlashMask | FlashMask: Efficient and Rich Mask Extension of FlashAttention | ![]() |
note |
iFLYTEK Research
Meta | Title | Cover | Publish | Code | Note |
---|---|---|---|---|---|
GRAIN | Gradient-based Intra-attention Pruning on Pre-trained Language Models | ![]() |
note |