EfficientPaper

Pruning, Quantization and efficient-inference/training paper list.

Table of Contents

Getting Started

git clone https://github.com/hustzxd/EfficientPaper
pip install protobuf==5.27.2 pandas arxiv 
  1. Add paper information by ./add_paper_info.sh
  2. Run ./refresh_readme.sh
efficient_paper.prototxt

paper {
  title: "EfficientPaper: manage your research papers in an efficient way."
  abbr: "EfficientPaper"
  url: "https://github.com/hustzxd/EfficientPaper"
  authors: "hustzxd"
}
pub {
  where: "GitHub"
  year: 2023
}
code {
  type: "Pytorch"
  url: "https://github.com/hustzxd/EfficientPaper"
}
note {
  url: "EfficientPaper.md"
}
keyword {
  words: efficient_paper
}

Recent Changes

Commit Date Author Message Path
8adb12c2025-09-28hustzxdweekly_updatedocs/weekly_paper
8adb12c2025-09-28hustzxdweekly_updatemeta
a20aef72025-09-19hustzxdweekly-updatedocs/weekly_paper
a20aef72025-09-19hustzxdweekly-updatemeta
65350712025-09-16hustzxd5papersdocs/weekly_paper
65350712025-09-16hustzxd5papersmeta

Paper List

year

2026

  1. FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation [Publish] GitHub Repo stars

2025

  1. AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference [Publish] GitHub Repo stars
  2. Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [Publish] GitHub Repo stars
  3. QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [Publish] GitHub Repo stars
  4. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [Publish]
  5. Training LLMs with MXFP4 [Publish] GitHub Repo stars
  6. COMET: Towards Partical W4A4KV4 LLMs Serving [Publish]
  7. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [Publish] GitHub Repo stars
  8. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [Publish] GitHub Repo stars
  9. BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [Publish] GitHub Repo stars
  10. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs [Publish]
  11. Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism [Publish]
  12. Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning [Publish] GitHub Repo stars
  13. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [Publish] GitHub Repo stars
  14. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference [Publish] GitHub Repo stars
  15. Forgetting Transformer: Softmax Attention with a Forget Gate [Publish] GitHub Repo stars
  16. R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [Publish] GitHub Repo stars
  17. ReAttention: Training-Free Infinite Context with Finite Attention Scope [Publish] GitHub Repo stars
  18. Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [Publish]
  19. TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention [Publish] GitHub Repo stars
  20. Training-Free Activation Sparsity in Large Language Models [Publish] GitHub Repo stars
  21. AdaSplash: Adaptive Sparse Flash Attention [Publish] GitHub Repo stars
  22. BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation [Publish]
  23. CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration [Publish]
  24. Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [Publish]
  25. HashAttention: Semantic Sparsity for Faster Inference [Publish] GitHub Repo stars
  26. La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation [Publish]
  27. MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention [Publish]
  28. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Publish] GitHub Repo stars
  29. SlimLLM: Accurate Structured Pruning for Large Language Models [Publish]
  30. SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference [Publish] GitHub Repo stars
  31. Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [Publish] GitHub Repo stars
  32. Star Attention: Efficient LLM Inference over Long Sequences [Publish] GitHub Repo stars
  33. XAttention: Block Sparse Attention with Antidiagonal Scoring [Publish] GitHub Repo stars
  34. TorchAO: PyTorch-Native Training-to-Serving Model Optimization [Publish] GitHub Repo stars
  35. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs [Publish]
  36. SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting [Publish] GitHub Repo stars
  37. Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models [Publish]
  38. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [Publish] GitHub Repo stars
  39. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [Publish]
  40. MoBA: Mixture of Block Attention for Long-Context LLMs [Publish] GitHub Repo stars
  41. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [Publish] GitHub Repo stars
  42. Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization [Publish]
  43. NanoFlow: Towards Optimal Large Language Model Serving Throughput [Publish] GitHub Repo stars
  44. Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores [Publish]
  45. PQCache: Product Quantization-based KVCache for Long Context LLM Inference [Publish]
  46. A Simple Linear Patch Revives Layer-Pruned Large Language Models [Publish]
  47. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [Publish]
  48. Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [Publish]
  49. Adaptive Computation Pruning for the Forgetting Transformer [Publish] GitHub Repo stars
  50. Adaptive Layer-skipping in Pre-trained LLMs [Publish]
  51. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models [Publish]
  52. Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models [Publish]
  53. AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [Publish]
  54. Binary Quantization For LLMs Through Dynamic Grouping [Publish] GitHub Repo stars
  55. CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs [Publish]
  56. Characterizing Communication Patterns in Distributed Large Language Model Inference [Publish]
  57. Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications [Publish]
  58. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [Publish]
  59. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Publish] GitHub Repo stars
  60. DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [Publish]
  61. DReSS: Data-driven Regularized Structured Streamlining for Large Language Models [Publish]
  62. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Publish] GitHub Repo stars
  63. DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference [Publish]
  64. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need [Publish]
  65. Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [Publish]
  66. EvolKV: Evolutionary KV Cache Compression for LLM Inference [Publish]
  67. Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs [Publish] GitHub Repo stars
  68. Fast and Simplex: 2-Simplicial Attention in Triton [Publish]
  69. FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [Publish] GitHub Repo stars
  70. Faster VGGT with Block-Sparse Global Attention [Publish]
  71. Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel [Publish] GitHub Repo stars
  72. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Publish] GitHub Repo stars
  73. FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension [Publish]
  74. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [Publish] GitHub Repo stars
  75. HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference [Publish] GitHub Repo stars
  76. HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs [Publish]
  77. Hardware-Efficient Attention for Fast Decoding [Publish] GitHub Repo stars
  78. Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding [Publish]
  79. Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation [Publish] GitHub Repo stars
  80. Instruction-Following Pruning for Large Language Models [Publish]
  81. KV Cache Compression for Inference Efficiency in LLMs: A Review [Publish]
  82. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [Publish] GitHub Repo stars
  83. KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache [Publish]
  84. KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [Publish]
  85. LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation [Publish] GitHub Repo stars
  86. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [Publish] GitHub Repo stars
  87. LeanK: Learnable K Cache Channel Pruning for Efficient Decoding [Publish] GitHub Repo stars
  88. MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving [Publish]
  89. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production [Publish]
  90. MiniCPM4: Ultra-Efficient LLMs on End Devices [Publish] GitHub Repo stars
  91. MiniMax-01: Scaling Foundation Models with Lightning Attention [Publish] GitHub Repo stars
  92. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention [Publish] GitHub Repo stars
  93. Mixture of Experts in Large Language Models [Publish]
  94. Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing [Publish] GitHub Repo stars
  95. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [Publish] GitHub Repo stars
  96. MoPEQ: Mixture of Mixed Precision Quantized Experts [Publish] GitHub Repo stars
  97. Mosaic: Composite Projection Pruning for Resource-efficient LLMs [Publish]
  98. PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference [Publish]
  99. Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [Publish]
  100. PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [Publish]
  101. Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [Publish] GitHub Repo stars
  102. Pruning General Large Language Models into Customized Expert Models [Publish] GitHub Repo stars
  103. QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization [Publish]
  104. Qwen3 Technical Report [Publish] GitHub Repo stars
  105. R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration [Publish] GitHub Repo stars
  106. Radial Attention: Sparse Attention with Energy Decay for Long Video Generation [Publish]
  107. Rectified Sparse Attention [Publish] GitHub Repo stars
  108. Retrospective Sparse Attention for Efficient Long-Context Generation [Publish]
  109. RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations [Publish]
  110. SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling [Publish] GitHub Repo stars
  111. SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [Publish] GitHub Repo stars
  112. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning [Publish] GitHub Repo stars
  113. Seesaw: High-throughput LLM Inference via Model Re-sharding [Publish]
  114. SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning [Publish]
  115. Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [Publish] GitHub Repo stars
  116. SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers [Publish] GitHub Repo stars
  117. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [Publish]
  118. Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads [Publish]
  119. The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [Publish] GitHub Repo stars
  120. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [Publish] GitHub Repo stars
  121. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [Publish] GitHub Repo stars
  122. Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [Publish] GitHub Repo stars
  123. Unveiling Super Experts in Mixture-of-Experts Large Language Models [Publish] GitHub Repo stars
  124. Attention-Gym: Triton-Based Sparse and Quantization Attention [Publish] GitHub Repo stars
  125. DeepEP: an efficient expert-parallel communication library [Publish] GitHub Repo stars
  126. Unified KV Cache Compression Methods for Auto-Regressive Models [Publish] GitHub Repo stars
  127. kvpress: LLM KV cache compression made easy [Publish] GitHub Repo stars

2024

  1. Fluctuation-based Adaptive Structured Pruning for Large Language Models [Publish] GitHub Repo stars
  2. Bench: Extending Long Context Evaluation Beyond 100K Tokens [Publish] GitHub Repo stars
  3. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [Publish] GitHub Repo stars
  4. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [Publish]
  5. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning [Publish]
  6. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives [Publish]
  7. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Publish]
  8. A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems [Publish] GitHub Repo stars
  9. [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch [Publish] GitHub Repo stars
  10. CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models [Publish] GitHub Repo stars
  11. Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [Publish] GitHub Repo stars
  12. SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference [Publish]
  13. Post-Training Statistical Calibration for Higher Activation Sparsity [Publish] GitHub Repo stars
  14. A Simple and Effective Pruning Approach for Large Language Models [Publish] GitHub Repo stars
  15. Compressing LLMs: The Truth is Rarely Pure and Never Simple [Publish] GitHub Repo stars
  16. Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [Publish] GitHub Repo stars
  17. Efficient Streaming Language Models with Attention Sinks [Publish] GitHub Repo stars
  18. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [Publish] GitHub Repo stars
  19. Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models [Publish] GitHub Repo stars
  20. QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [Publish] GitHub Repo stars
  21. SAS: Structured Activation Spasification [Publish] GitHub Repo stars
  22. SliceGPT: Compress Large Language Models by Deleting Rows and Columns [Publish] GitHub Repo stars
  23. ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models [Publish] GitHub Repo stars
  24. Accelerating Transformer Pre-training with 2:4 Sparsity [Publish] GitHub Repo stars
  25. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [Publish] GitHub Repo stars
  26. FrameQuant: Flexible Low-Bit Quantization for Transformers [Publish] GitHub Repo stars
  27. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [Publish] GitHub Repo stars
  28. LoRA+: Efficient Low Rank Adaptation of Large Models [Publish] GitHub Repo stars
  29. OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization [Publish] GitHub Repo stars
  30. Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [Publish] GitHub Repo stars
  31. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference [Publish] GitHub Repo stars
  32. SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [Publish] GitHub Repo stars
  33. SparQ Attention: Bandwidth-Efficient LLM Inference [Publish]
  34. Sparse is Enough in Fine-tuning Pre-trained Large Language Models [Publish] GitHub Repo stars
  35. Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [Publish] GitHub Repo stars
  36. SqueezeLLM: Dense-and-Sparse Quantization [Publish] GitHub Repo stars
  37. TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge [Publish] GitHub Repo stars
  38. Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts [Publish] GitHub Repo stars
  39. Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [Publish] GitHub Repo stars
  40. Splitwise: Efficient generative LLM inference using phase splitting [Publish] GitHub Repo stars
  41. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Publish] GitHub Repo stars
  42. Vidur: A Large-Scale Simulation Framework For LLM Inference [Publish] GitHub Repo stars
  43. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Publish] GitHub Repo stars
  44. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [Publish] GitHub Repo stars
  45. MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models [Publish] GitHub Repo stars
  46. SGLang: Efficient Execution of Structured Language Model Programs [Publish] GitHub Repo stars
  47. SlimGPT: Layer-wise Structured Pruning for Large Language Models [Publish]
  48. SparseLLM: Towards Global Pruning for Pre-trained Language Models [Publish] GitHub Repo stars
  49. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [Publish]
  50. Fast and Effective Weight Update for Pruned Large Language Models [Publish] GitHub Repo stars
  51. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity [Publish] GitHub Repo stars
  52. A Survey on Efficient Inference for Large Language Models [Publish]
  53. A Survey on Inference Optimization Techniques for Mixture of Experts Models [Publish] GitHub Repo stars
  54. A Survey on Large Language Model Acceleration based on KV Cache Management [Publish] GitHub Repo stars
  55. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving [Publish] GitHub Repo stars
  56. AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis [Publish]
  57. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [Publish] GitHub Repo stars
  58. Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs [Publish]
  59. Beyond KV Caching: Shared Attention for Efficient LLMs [Publish] GitHub Repo stars
  60. Compact Language Models via Pruning and Knowledge Distillation [Publish] GitHub Repo stars
  61. CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation [Publish] GitHub Repo stars
  62. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [Publish] GitHub Repo stars
  63. DeepSeek-V3 Technical Report [Publish] GitHub Repo stars
  64. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [Publish] GitHub Repo stars
  65. Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping [Publish] GitHub Repo stars
  66. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads [Publish] GitHub Repo stars
  67. Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [Publish] GitHub Repo stars
  68. Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes [Publish] GitHub Repo stars
  69. FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [Publish] GitHub Repo stars
  70. FlashMask: Efficient and Rich Mask Extension of FlashAttention [Publish] GitHub Repo stars
  71. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM [Publish] GitHub Repo stars
  72. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Publish]
  73. L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ [Publish]
  74. LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [Publish]
  75. LLM Inference Serving: Survey of Recent Advances and Opportunities [Publish]
  76. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [Publish]
  77. Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models [Publish] GitHub Repo stars
  78. Massive Activations in Large Language Models [Publish] GitHub Repo stars
  79. MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [Publish] GitHub Repo stars
  80. MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache [Publish] GitHub Repo stars
  81. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [Publish]
  82. MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression [Publish] GitHub Repo stars
  83. Multi-matrix Factorization Attention [Publish]
  84. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [Publish]
  85. Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [Publish] Pytorch
  86. Post-Training Sparse Attention with Double Sparsity [Publish] GitHub Repo stars
  87. PowerInfer-2: Fast Large Language Model Inference on a Smartphone [Publish] Website
  88. PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization [Publish] GitHub Repo stars
  89. ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [Publish] GitHub Repo stars
  90. Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [Publish]
  91. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [Publish] Pytorch
  92. ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs [Publish]
  93. ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [Publish] GitHub Repo stars
  94. Recycled Attention: Efficient inference for long-context language models [Publish] GitHub Repo stars
  95. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [Publish] GitHub Repo stars
  96. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [Publish] GitHub Repo stars
  97. SCBench: A KV Cache-Centric Analysis of Long-Context Methods [Publish] GitHub Repo stars
  98. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [Publish] GitHub Repo stars
  99. SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [Publish] GitHub Repo stars
  100. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [Publish]
  101. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [Publish] GitHub Repo stars
  102. ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [Publish] GitHub Repo stars
  103. SnapKV: LLM Knows What You are Looking for Before Generation [Publish] GitHub Repo stars
  104. Transformers are Multi-State RNNs [Publish] GitHub Repo stars
  105. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters [Publish] Pytorch
  106. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [Publish] GitHub Repo stars
  107. ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty [Publish]
  108. ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification [Publish]

2023

  1. Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [Publish] GitHub Repo stars
  2. Gradient-based Intra-attention Pruning on Pre-trained Language Models [Publish] GitHub Repo stars
  3. Pruning Pre-trained Language Models Without Fine-Tuning [Publish] GitHub Repo stars
  4. Pruning Pre-trained Language Models with Principled Importance and Self-regularization [Publish] GitHub Repo stars
  5. Structured Pruning for Efficient Generative Pre-trained Language Models [Publish]
  6. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models [Publish]
  7. Structural Pruning of Large Language Models via Neural Architecture Search [Publish] GitHub Repo stars
  8. SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer [Publish] GitHub Repo stars
  9. TorchSparse++: Efficient Point Cloud Engine [Publish] GitHub Repo stars
  10. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [Publish] GitHub Repo stars
  11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Publish] GitHub Repo stars
  12. Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [Publish]
  13. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [Publish]
  14. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time [Publish] GitHub Repo stars
  15. SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. [Publish] GitHub Repo stars
  16. Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation [Publish] GitHub Repo stars
  17. Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning [Publish] GitHub Repo stars
  18. ZipLM: Inference-Aware Structured Pruning of Language Models [Publish] GitHub Repo stars
  19. VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores [Publish] GitHub Repo stars
  20. Efficient Memory Management for Large Language Model Serving with PagedAttention [Publish] GitHub Repo stars
  21. Efficient Methods for Natural Language Processing: A Survey [Publish]
  22. SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [Publish]
  23. A Survey on Evaluation of Large Language Models [Publish]
  24. A Survey on Model Compression for Large Language Models [Publish]
  25. Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models [Publish] GitHub Repo stars
  26. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [Publish] GitHub Repo stars
  27. Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models [Publish] GitHub Repo stars
  28. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [Publish] GitHub Repo stars
  29. Efficient Guided Generation for Large Language Models [Publish]
  30. Fine-Tuning Language Models with Just Forward Passes [Publish] GitHub Repo stars
  31. Flash-Decoding for long-context inference [Publish]
  32. Gradient-Free Structured Pruning with Unlabeled Data [Publish]
  33. HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [Publish] GitHub Repo stars
  34. Knowledge-preserving Pruning for Pre-trained Language Models without Retraining [Publish]
  35. LLM in a flash: Efficient Large Language Model Inference with Limited Memory [Publish]
  36. LLM-Pruner: On the Structural Pruning of Large Language Models [Publish] GitHub Repo stars
  37. LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery [Publish]
  38. LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models [Publish] GitHub Repo stars
  39. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [Publish] GitHub Repo stars
  40. Post-training Quantization for Neural Networks with Provable Guarantees [Publish] GitHub Repo stars
  41. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [Publish] GitHub Repo stars
  42. Pruning Large Language Models via Accuracy Predictor [Publish]
  43. QLoRA: Efficient Finetuning of Quantized LLMs [Publish] GitHub Repo stars
  44. QuIP: Quantization with Incoherence Processing [Publish] GitHub Repo stars
  45. RPTQ: Reorder-based Post-training Quantization for Large Language Models [Publish] GitHub Repo stars
  46. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [Publish] GitHub Repo stars
  47. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [Publish] GitHub Repo stars
  48. Sparse Fine-tuning for Inference Acceleration of Large Language Models [Publish] GitHub Repo stars
  49. Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [Publish] GitHub Repo stars
  50. Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [Publish] GitHub Repo stars
  51. Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers [Publish]
  52. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [Publish] GitHub Repo stars
  53. Training Transformers with 4-bit Integers [Publish] GitHub Repo stars
  54. Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering [Publish] GitHub Repo stars
  55. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [Publish] GitHub Repo stars
  56. FasterTransformer [Publish] GitHub Repo stars

2022

  1. Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm [Publish] GitHub Repo stars
  2. TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models [Publish] GitHub Repo stars
  3. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads [Publish] GitHub Repo stars
  4. Creating Sparse GPT-3 Models with Iterative Pruning [Publish]
  5. LoRA: Low-rank adaptation of large language models [Publish] GitHub Repo stars
  6. SPDY: Accurate Pruning with Speedup Guarantees [Publish] GitHub Repo stars
  7. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [Publish]
  8. A Fast Post-Training Pruning Framework for Transformers [Publish] GitHub Repo stars
  9. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [Publish] GitHub Repo stars
  10. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning [Publish] GitHub Repo stars
  11. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [Publish] GitHub Repo stars
  12. Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks [Publish]
  13. Transformer Acceleration with Dynamic Sparse Attention [Publish]
  14. An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [Publish]
  15. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models [Publish] GitHub Repo stars

2021

  1. Post-training deep neural network pruning via layer-wise calibration [Publish]
  2. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction [Publish] GitHub Repo stars
  3. Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [Publish] GitHub Repo stars
  4. A Greedy Algorithm for Quantizing Neural Networks [Publish] GitHub Repo stars
  5. Channel Permutations for N:M Sparsity [Publish] GitHub Repo stars
  6. Accelerating Sparse Deep Neural Networks [Publish]
  7. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks [Publish]

2020

  1. Fast Sparse ConvNets [Publish] GitHub Repo stars
  2. Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference [Publish]
  3. Movement Pruning: Adaptive Sparsity by Fine-Tuning [Publish] GitHub Repo stars
  4. GPU Kernels for Block-Sparse Weights [Publish] GitHub Repo stars

2019

  1. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training [Publish] GitHub Repo stars

2018

  1. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers [Publish] GitHub Repo stars

2017

  1. DSD: Dense-Sparse-Dense Training for Deep Neural Networks [Publish]
  2. Attention Is All You Need [Publish]
  3. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon [Publish] GitHub Repo stars

2016

  1. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding [Publish]

1993

  1. Optimal Brain Surgeon and general network pruning [Publish]

1989

  1. Optimal Brain Damage [Publish]

References

  1. https://github.com/Xnhyacinth/Awesome-LLM-Long-Context-Modeling [GitHub Repo stars]

  2. https://github.com/weigao266/Awesome-Efficient-Arch [GitHub Repo stars]

  3. https://github.com/horseee/Awesome-Efficient-LLM [GitHub Repo stars]

  4. https://github.com/DefTruth/Awesome-Diffusion-Inference [GitHub Repo stars]

  5. https://github.com/DefTruth/Awesome-LLM-Inference [GitHub Repo stars]

  6. https://github.com/AmberLJC/LLMSys-PaperList [GitHub Repo stars]

  7. https://github.com/Hannibal046/Awesome-LLM [GitHub Repo stars]

  8. https://github.com/AmadeusChan/Awesome-LLM-System-Papers [GitHub Repo stars]

  9. https://github.com/KnowingNothing/compiler-and-arch [GitHub Repo stars]

  10. https://papercopilot.com/paper-list

  11. https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management [GitHub Repo stars]

  12. https://github.com/October2001/Awesome-KV-Cache-Compression [GitHub Repo stars]

  13. https://github.com/he-y/Awesome-Pruning [GitHub Repo stars]

  14. https://github.com/htqin/awesome-model-quantization [GitHub Repo stars]

  15. https://github.com/csyhhu/Awesome-Deep-Neural-Network-Compression [GitHub Repo stars]

  16. https://github.com/AojunZhou/Efficient-Deep-Learning [GitHub Repo stars]

  17. https://github.com/chester256/Model-Compression-Papers [GitHub Repo stars]