EfficientPaper

Pruning, Quantization and efficient-inference/training paper list.

Table of Contents

Getting Started

git clone https://github.com/hustzxd/EfficientPaper
pip install protobuf==5.27.2 pandas arxiv 
  1. Add paper information by ./add_paper_info.sh
  2. Run ./refresh_readme.sh
efficient_paper.prototxt

paper {
  title: "EfficientPaper: manage your research papers in an efficient way."
  abbr: "EfficientPaper"
  url: "https://github.com/hustzxd/EfficientPaper"
  authors: "hustzxd"
}
pub {
  where: "GitHub"
  year: 2023
}
code {
  type: "Pytorch"
  url: "https://github.com/hustzxd/EfficientPaper"
}
note {
  url: "EfficientPaper.md"
}
keyword {
  words: efficient_paper
}

Paper List

year

2025

  1. AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference [Publish] GitHub Repo stars
  2. Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [Publish] GitHub Repo stars
  3. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [Publish]
  4. COMET: Towards Partical W4A4KV4 LLMs Serving [Publish]
  5. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [Publish] GitHub Repo stars
  6. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [Publish] GitHub Repo stars
  7. BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [Publish] GitHub Repo stars
  8. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs [Publish]
  9. Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism [Publish]
  10. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [Publish] GitHub Repo stars
  11. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference [Publish] GitHub Repo stars
  12. Forgetting Transformer: Softmax Attention with a Forget Gate [Publish] GitHub Repo stars
  13. R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [Publish] GitHub Repo stars
  14. ReAttention: Training-Free Infinite Context with Finite Attention Scope [Publish] GitHub Repo stars
  15. Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [Publish]
  16. Training-Free Activation Sparsity in Large Language Models [Publish] GitHub Repo stars
  17. AdaSplash: Adaptive Sparse Flash Attention [Publish] GitHub Repo stars
  18. BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation [Publish]
  19. CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration [Publish]
  20. Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [Publish]
  21. HashAttention: Semantic Sparsity for Faster Inference [Publish] GitHub Repo stars
  22. La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation [Publish]
  23. MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention [Publish]
  24. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Publish] GitHub Repo stars
  25. SlimLLM: Accurate Structured Pruning for Large Language Models [Publish]
  26. SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference [Publish] GitHub Repo stars
  27. Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [Publish] GitHub Repo stars
  28. Star Attention: Efficient LLM Inference over Long Sequences [Publish] GitHub Repo stars
  29. XAttention: Block Sparse Attention with Antidiagonal Scoring [Publish] GitHub Repo stars
  30. TorchAO: PyTorch-Native Training-to-Serving Model Optimization [Publish] GitHub Repo stars
  31. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs [Publish]
  32. SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting [Publish] GitHub Repo stars
  33. Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models [Publish]
  34. NanoFlow: Towards Optimal Large Language Model Serving Throughput [Publish] GitHub Repo stars
  35. A Simple Linear Patch Revives Layer-Pruned Large Language Models [Publish]
  36. Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores [Publish]
  37. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [Publish]
  38. Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [Publish]
  39. Adaptive Computation Pruning for the Forgetting Transformer [Publish] GitHub Repo stars
  40. Adaptive Layer-skipping in Pre-trained LLMs [Publish]
  41. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models [Publish]
  42. Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models [Publish]
  43. AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [Publish]
  44. CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs [Publish]
  45. Characterizing Communication Patterns in Distributed Large Language Model Inference [Publish]
  46. Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications [Publish]
  47. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [Publish]
  48. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Publish] GitHub Repo stars
  49. DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [Publish]
  50. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Publish] GitHub Repo stars
  51. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [Publish]
  52. DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference [Publish]
  53. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need [Publish]
  54. Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [Publish]
  55. Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs [Publish] GitHub Repo stars
  56. Fast and Simplex: 2-Simplicial Attention in Triton [Publish]
  57. FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [Publish] GitHub Repo stars
  58. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Publish] GitHub Repo stars
  59. FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation [Publish] GitHub Repo stars
  60. FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension [Publish]
  61. HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference [Publish] GitHub Repo stars
  62. HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs [Publish]
  63. Hardware-Efficient Attention for Fast Decoding [Publish] GitHub Repo stars
  64. Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding [Publish]
  65. Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation [Publish] GitHub Repo stars
  66. Instruction-Following Pruning for Large Language Models [Publish]
  67. KV Cache Compression for Inference Efficiency in LLMs: A Review [Publish]
  68. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [Publish] GitHub Repo stars
  69. KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [Publish]
  70. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [Publish] GitHub Repo stars
  71. LeanK: Learnable K Cache Channel Pruning for Efficient Decoding [Publish] GitHub Repo stars
  72. MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving [Publish]
  73. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production [Publish]
  74. MiniCPM4: Ultra-Efficient LLMs on End Devices [Publish] GitHub Repo stars
  75. Mixture of Experts in Large Language Models [Publish]
  76. Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing [Publish] GitHub Repo stars
  77. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [Publish] GitHub Repo stars
  78. MoBA: Mixture of Block Attention for Long-Context LLMs [Publish] GitHub Repo stars
  79. Mosaic: Composite Projection Pruning for Resource-efficient LLMs [Publish]
  80. Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [Publish]
  81. PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [Publish]
  82. Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [Publish] GitHub Repo stars
  83. Pruning General Large Language Models into Customized Expert Models [Publish] GitHub Repo stars
  84. QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization [Publish]
  85. Qwen3 Technical Report [Publish] GitHub Repo stars
  86. R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration [Publish] GitHub Repo stars
  87. Radial Attention: Sparse Attention with Energy Decay for Long Video Generation [Publish]
  88. Rectified Sparse Attention [Publish] GitHub Repo stars
  89. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [Publish] GitHub Repo stars
  90. SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling [Publish] GitHub Repo stars
  91. SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [Publish] GitHub Repo stars
  92. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [Publish] GitHub Repo stars
  93. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning [Publish] GitHub Repo stars
  94. Seesaw: High-throughput LLM Inference via Model Re-sharding [Publish]
  95. Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [Publish] GitHub Repo stars
  96. SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers [Publish] GitHub Repo stars
  97. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [Publish]
  98. Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads [Publish]
  99. The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [Publish] GitHub Repo stars
  100. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [Publish] GitHub Repo stars
  101. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [Publish] GitHub Repo stars
  102. Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [Publish] GitHub Repo stars
  103. Unveiling Super Experts in Mixture-of-Experts Large Language Models [Publish] GitHub Repo stars
  104. Attention-Gym: Triton-Based Sparse and Quantization Attention [Publish] GitHub Repo stars
  105. Unified KV Cache Compression Methods for Auto-Regressive Models [Publish] GitHub Repo stars
  106. kvpress: LLM KV cache compression made easy [Publish] GitHub Repo stars

2024

  1. Fluctuation-based Adaptive Structured Pruning for Large Language Models [Publish] GitHub Repo stars
  2. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [Publish] GitHub Repo stars
  3. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning [Publish]
  4. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives [Publish]
  5. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Publish]
  6. A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems [Publish] GitHub Repo stars
  7. CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models [Publish] GitHub Repo stars
  8. Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [Publish] GitHub Repo stars
  9. SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference [Publish]
  10. Post-Training Statistical Calibration for Higher Activation Sparsity [Publish] GitHub Repo stars
  11. A Simple and Effective Pruning Approach for Large Language Models [Publish] GitHub Repo stars
  12. Compressing LLMs: The Truth is Rarely Pure and Never Simple [Publish] GitHub Repo stars
  13. Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [Publish] GitHub Repo stars
  14. Efficient Streaming Language Models with Attention Sinks [Publish] GitHub Repo stars
  15. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [Publish] GitHub Repo stars
  16. Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models [Publish] GitHub Repo stars
  17. QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [Publish] GitHub Repo stars
  18. SAS: Structured Activation Spasification [Publish] GitHub Repo stars
  19. SliceGPT: Compress Large Language Models by Deleting Rows and Columns [Publish] GitHub Repo stars
  20. ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models [Publish] GitHub Repo stars
  21. Accelerating Transformer Pre-training with 2:4 Sparsity [Publish] GitHub Repo stars
  22. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [Publish] GitHub Repo stars
  23. FrameQuant: Flexible Low-Bit Quantization for Transformers [Publish] GitHub Repo stars
  24. LoRA+: Efficient Low Rank Adaptation of Large Models [Publish] GitHub Repo stars
  25. OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization [Publish] GitHub Repo stars
  26. Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [Publish] GitHub Repo stars
  27. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference [Publish] GitHub Repo stars
  28. SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [Publish] GitHub Repo stars
  29. SparQ Attention: Bandwidth-Efficient LLM Inference [Publish]
  30. Sparse is Enough in Fine-tuning Pre-trained Large Language Models [Publish] GitHub Repo stars
  31. Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [Publish] GitHub Repo stars
  32. SqueezeLLM: Dense-and-Sparse Quantization [Publish] GitHub Repo stars
  33. TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge [Publish] GitHub Repo stars
  34. Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts [Publish] GitHub Repo stars
  35. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Publish] GitHub Repo stars
  36. Vidur: A Large-Scale Simulation Framework For LLM Inference [Publish] GitHub Repo stars
  37. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [Publish] GitHub Repo stars
  38. MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models [Publish] GitHub Repo stars
  39. SGLang: Efficient Execution of Structured Language Model Programs [Publish] GitHub Repo stars
  40. SlimGPT: Layer-wise Structured Pruning for Large Language Models [Publish]
  41. SparseLLM: Towards Global Pruning for Pre-trained Language Models [Publish] GitHub Repo stars
  42. Fast and Effective Weight Update for Pruned Large Language Models [Publish] GitHub Repo stars
  43. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity [Publish] GitHub Repo stars
  44. A Survey on Efficient Inference for Large Language Models [Publish]
  45. A Survey on Inference Optimization Techniques for Mixture of Experts Models [Publish] GitHub Repo stars
  46. A Survey on Large Language Model Acceleration based on KV Cache Management [Publish] GitHub Repo stars
  47. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving [Publish] GitHub Repo stars
  48. AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis [Publish]
  49. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [Publish] GitHub Repo stars
  50. Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs [Publish]
  51. Beyond KV Caching: Shared Attention for Efficient LLMs [Publish] GitHub Repo stars
  52. Compact Language Models via Pruning and Knowledge Distillation [Publish] GitHub Repo stars
  53. CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation [Publish] GitHub Repo stars
  54. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [Publish] GitHub Repo stars
  55. DeepSeek-V3 Technical Report [Publish] GitHub Repo stars
  56. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [Publish] GitHub Repo stars
  57. Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping [Publish] GitHub Repo stars
  58. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads [Publish] GitHub Repo stars
  59. Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [Publish] GitHub Repo stars
  60. Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes [Publish] GitHub Repo stars
  61. FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [Publish]
  62. FlashMask: Efficient and Rich Mask Extension of FlashAttention [Publish] GitHub Repo stars
  63. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Publish]
  64. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Publish] GitHub Repo stars
  65. L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ [Publish]
  66. LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [Publish]
  67. LLM Inference Serving: Survey of Recent Advances and Opportunities [Publish]
  68. Massive Activations in Large Language Models [Publish] GitHub Repo stars
  69. MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache [Publish]
  70. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [Publish]
  71. MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression [Publish] GitHub Repo stars
  72. Multi-matrix Factorization Attention [Publish]
  73. Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [Publish] Pytorch
  74. Post-Training Sparse Attention with Double Sparsity [Publish] GitHub Repo stars
  75. PowerInfer-2: Fast Large Language Model Inference on a Smartphone [Publish] Website
  76. ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [Publish] GitHub Repo stars
  77. Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [Publish]
  78. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [Publish] Pytorch
  79. ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs [Publish]
  80. ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [Publish] GitHub Repo stars
  81. Recycled Attention: Efficient inference for long-context language models [Publish] GitHub Repo stars
  82. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [Publish] GitHub Repo stars
  83. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [Publish] GitHub Repo stars
  84. SCBench: A KV Cache-Centric Analysis of Long-Context Methods [Publish] GitHub Repo stars
  85. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [Publish] GitHub Repo stars
  86. SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [Publish] GitHub Repo stars
  87. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [Publish]
  88. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [Publish] GitHub Repo stars
  89. ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [Publish] GitHub Repo stars
  90. SnapKV: LLM Knows What You are Looking for Before Generation [Publish] GitHub Repo stars
  91. Transformers are Multi-State RNNs [Publish] GitHub Repo stars
  92. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters [Publish] Pytorch
  93. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [Publish] GitHub Repo stars
  94. ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty [Publish]
  95. [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch [Publish] GitHub Repo stars

2023

  1. Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [Publish] GitHub Repo stars
  2. Gradient-based Intra-attention Pruning on Pre-trained Language Models [Publish] GitHub Repo stars
  3. Pruning Pre-trained Language Models Without Fine-Tuning [Publish] GitHub Repo stars
  4. Pruning Pre-trained Language Models with Principled Importance and Self-regularization [Publish] GitHub Repo stars
  5. Structured Pruning for Efficient Generative Pre-trained Language Models [Publish]
  6. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models [Publish]
  7. Structural Pruning of Large Language Models via Neural Architecture Search [Publish] GitHub Repo stars
  8. SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer [Publish] GitHub Repo stars
  9. TorchSparse++: Efficient Point Cloud Engine [Publish] GitHub Repo stars
  10. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [Publish] GitHub Repo stars
  11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Publish] GitHub Repo stars
  12. Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [Publish]
  13. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [Publish]
  14. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time [Publish] GitHub Repo stars
  15. SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. [Publish] GitHub Repo stars
  16. Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation [Publish] GitHub Repo stars
  17. Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning [Publish] GitHub Repo stars
  18. ZipLM: Inference-Aware Structured Pruning of Language Models [Publish] GitHub Repo stars
  19. VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores [Publish] GitHub Repo stars
  20. Efficient Memory Management for Large Language Model Serving with PagedAttention [Publish] GitHub Repo stars
  21. Efficient Methods for Natural Language Processing: A Survey [Publish]
  22. SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [Publish]
  23. A Survey on Evaluation of Large Language Models [Publish]
  24. A Survey on Model Compression for Large Language Models [Publish]
  25. Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models [Publish] GitHub Repo stars
  26. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [Publish] GitHub Repo stars
  27. Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models [Publish] GitHub Repo stars
  28. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [Publish]
  29. Efficient Guided Generation for Large Language Models [Publish]
  30. Fine-Tuning Language Models with Just Forward Passes [Publish] GitHub Repo stars
  31. Flash-Decoding for long-context inference [Publish]
  32. Gradient-Free Structured Pruning with Unlabeled Data [Publish]
  33. HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [Publish] GitHub Repo stars
  34. Knowledge-preserving Pruning for Pre-trained Language Models without Retraining [Publish]
  35. LLM in a flash: Efficient Large Language Model Inference with Limited Memory [Publish]
  36. LLM-Pruner: On the Structural Pruning of Large Language Models [Publish] GitHub Repo stars
  37. LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery [Publish]
  38. LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models [Publish] GitHub Repo stars
  39. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [Publish] GitHub Repo stars
  40. Post-training Quantization for Neural Networks with Provable Guarantees [Publish] GitHub Repo stars
  41. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [Publish] GitHub Repo stars
  42. Pruning Large Language Models via Accuracy Predictor [Publish]
  43. QLoRA: Efficient Finetuning of Quantized LLMs [Publish] GitHub Repo stars
  44. QuIP: Quantization with Incoherence Processing [Publish] GitHub Repo stars
  45. RPTQ: Reorder-based Post-training Quantization for Large Language Models [Publish] GitHub Repo stars
  46. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [Publish] GitHub Repo stars
  47. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [Publish] GitHub Repo stars
  48. Sparse Fine-tuning for Inference Acceleration of Large Language Models [Publish] GitHub Repo stars
  49. Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [Publish] GitHub Repo stars
  50. Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [Publish] GitHub Repo stars
  51. Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers [Publish]
  52. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [Publish] GitHub Repo stars
  53. Training Transformers with 4-bit Integers [Publish] GitHub Repo stars
  54. Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering [Publish] GitHub Repo stars
  55. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [Publish] GitHub Repo stars
  56. FasterTransformer [Publish] GitHub Repo stars

2022

  1. Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm [Publish] GitHub Repo stars
  2. TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models [Publish] GitHub Repo stars
  3. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads [Publish] GitHub Repo stars
  4. Creating Sparse GPT-3 Models with Iterative Pruning [Publish]
  5. LoRA: Low-rank adaptation of large language models [Publish] GitHub Repo stars
  6. SPDY: Accurate Pruning with Speedup Guarantees [Publish] GitHub Repo stars
  7. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [Publish]
  8. A Fast Post-Training Pruning Framework for Transformers [Publish] GitHub Repo stars
  9. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [Publish] GitHub Repo stars
  10. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning [Publish] GitHub Repo stars
  11. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [Publish] GitHub Repo stars
  12. Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks [Publish]
  13. Transformer Acceleration with Dynamic Sparse Attention [Publish]
  14. An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [Publish]
  15. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models [Publish] GitHub Repo stars

2021

  1. Post-training deep neural network pruning via layer-wise calibration [Publish]
  2. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction [Publish] GitHub Repo stars
  3. Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [Publish] GitHub Repo stars
  4. A Greedy Algorithm for Quantizing Neural Networks [Publish] GitHub Repo stars
  5. Channel Permutations for N:M Sparsity [Publish] GitHub Repo stars
  6. Accelerating Sparse Deep Neural Networks [Publish]
  7. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks [Publish]

2020

  1. Fast Sparse ConvNets [Publish] GitHub Repo stars
  2. Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference [Publish]
  3. Movement Pruning: Adaptive Sparsity by Fine-Tuning [Publish] GitHub Repo stars
  4. GPU Kernels for Block-Sparse Weights [Publish] GitHub Repo stars

2019

  1. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training [Publish] GitHub Repo stars

2018

  1. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers [Publish] GitHub Repo stars

2017

  1. DSD: Dense-Sparse-Dense Training for Deep Neural Networks [Publish]
  2. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon [Publish] GitHub Repo stars

2016

  1. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding [Publish]

1993

  1. Optimal Brain Surgeon and general network pruning [Publish]

1989

  1. Optimal Brain Damage [Publish]
  2. Optimal Brain Damage [Publish]

References

Hot

  1. https://github.com/horseee/Awesome-Efficient-LLM
  2. https://github.com/DefTruth/Awesome-Diffusion-Inference
  3. https://github.com/DefTruth/Awesome-LLM-Inference
  4. https://github.com/AmberLJC/LLMSys-PaperList
  5. https://github.com/Hannibal046/Awesome-LLM
  6. https://github.com/AmadeusChan/Awesome-LLM-System-Papers
  7. https://github.com/KnowingNothing/compiler-and-arch
  8. https://papercopilot.com/paper-list
  9. https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management
  10. https://github.com/October2001/Awesome-KV-Cache-Compression

Cold

  1. https://github.com/he-y/Awesome-Pruning
  2. https://github.com/htqin/awesome-model-quantization
  3. https://github.com/csyhhu/Awesome-Deep-Neural-Network-Compression
  4. https://github.com/AojunZhou/Efficient-Deep-Learning
  5. https://github.com/chester256/Model-Compression-Papers