EfficientPaper
Pruning, Quantization and efficient-inference/training paper list.
Table of Contents
Getting Started
git clone https://github.com/hustzxd/EfficientPaper
pip install protobuf==5.27.2 pandas arxiv
- Add paper information by
./add_paper_info.sh
- Run
./refresh_readme.sh
efficient_paper.prototxt
paper {
title: "EfficientPaper: manage your research papers in an efficient way."
abbr: "EfficientPaper"
url: "https://github.com/hustzxd/EfficientPaper"
authors: "hustzxd"
}
pub {
where: "GitHub"
year: 2023
}
code {
type: "Pytorch"
url: "https://github.com/hustzxd/EfficientPaper"
}
note {
url: "EfficientPaper.md"
}
keyword {
words: efficient_paper
}
Paper List
year
2025
- AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference [
]
- Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [
]
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [
]
- COMET: Towards Partical W4A4KV4 LLMs Serving [
]
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [
]
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [
]
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [
]
- KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs [
]
- Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism [
]
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [
]
- FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference [
]
- Forgetting Transformer: Softmax Attention with a Forget Gate [
]
- R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [
]
- ReAttention: Training-Free Infinite Context with Finite Attention Scope [
]
- Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [
]
- Training-Free Activation Sparsity in Large Language Models [
]
- AdaSplash: Adaptive Sparse Flash Attention [
]
- BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation [
]
- CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration [
]
- Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [
]
- HashAttention: Semantic Sparsity for Faster Inference [
]
- La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation [
]
- MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention [
]
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [
]
- SlimLLM: Accurate Structured Pruning for Large Language Models [
]
- SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference [
]
- Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [
]
- Star Attention: Efficient LLM Inference over Long Sequences [
]
- XAttention: Block Sparse Attention with Antidiagonal Scoring [
]
- TorchAO: PyTorch-Native Training-to-Serving Model Optimization [
]
- AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs [
]
- SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting [
]
- Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models [
]
- NanoFlow: Towards Optimal Large Language Model Serving Throughput [
]
- A Simple Linear Patch Revives Layer-Pruned Large Language Models [
]
- Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores [
]
- Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [
]
- Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [
]
- Adaptive Computation Pruning for the Forgetting Transformer [
]
- Adaptive Layer-skipping in Pre-trained LLMs [
]
- AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models [
]
- Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models [
]
- AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [
]
- CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs [
]
- Characterizing Communication Patterns in Distributed Large Language Model Inference [
]
- Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications [
]
- ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [
]
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [
]
- DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [
]
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [
]
- Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [
]
- DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference [
]
- Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need [
]
- Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [
]
- Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs [
]
- Fast and Simplex: 2-Simplicial Attention in Triton [
]
- FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [
]
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [
]
- FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation [
]
- FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension [
]
- HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference [
]
- HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs [
]
- Hardware-Efficient Attention for Fast Decoding [
]
- Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding [
]
- Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation [
]
- Instruction-Following Pruning for Large Language Models [
]
- KV Cache Compression for Inference Efficiency in LLMs: A Review [
]
- KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [
]
- KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [
]
- LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [
]
- LeanK: Learnable K Cache Channel Pruning for Efficient Decoding [
]
- MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving [
]
- MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production [
]
- MiniCPM4: Ultra-Efficient LLMs on End Devices [
]
- Mixture of Experts in Large Language Models [
]
- Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing [
]
- Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [
]
- MoBA: Mixture of Block Attention for Long-Context LLMs [
]
- Mosaic: Composite Projection Pruning for Resource-efficient LLMs [
]
- Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [
]
- PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [
]
- Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [
]
- Pruning General Large Language Models into Customized Expert Models [
]
- QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization [
]
- Qwen3 Technical Report [
]
- R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration [
]
- Radial Attention: Sparse Attention with Energy Decay for Long Video Generation [
]
- Rectified Sparse Attention [
]
- Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [
]
- SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling [
]
- SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [
]
- SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [
]
- SeerAttention-R: Sparse Attention Adaptation for Long Reasoning [
]
- Seesaw: High-throughput LLM Inference via Model Re-sharding [
]
- Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [
]
- SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers [
]
- Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [
]
- Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads [
]
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [
]
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [
]
- TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [
]
- Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [
]
- Unveiling Super Experts in Mixture-of-Experts Large Language Models [
]
- Attention-Gym: Triton-Based Sparse and Quantization Attention [
]
- Unified KV Cache Compression Methods for Auto-Regressive Models [
]
- kvpress: LLM KV cache compression made easy [
]
2024
- Fluctuation-based Adaptive Structured Pruning for Large Language Models [
]
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [
]
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning [
]
- T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives [
]
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [
]
- A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems [
]
- CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models [
]
- Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [
]
- SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference [
]
- Post-Training Statistical Calibration for Higher Activation Sparsity [
]
- A Simple and Effective Pruning Approach for Large Language Models [
]
- Compressing LLMs: The Truth is Rarely Pure and Never Simple [
]
- Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [
]
- Efficient Streaming Language Models with Attention Sinks [
]
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [
]
- Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models [
]
- QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [
]
- SAS: Structured Activation Spasification [
]
- SliceGPT: Compress Large Language Models by Deleting Rows and Columns [
]
- ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models [
]
- Accelerating Transformer Pre-training with 2:4 Sparsity [
]
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [
]
- FrameQuant: Flexible Low-Bit Quantization for Transformers [
]
- LoRA+: Efficient Low Rank Adaptation of Large Models [
]
- OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization [
]
- Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [
]
- Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference [
]
- SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [
]
- SparQ Attention: Bandwidth-Efficient LLM Inference [
]
- Sparse is Enough in Fine-tuning Pre-trained Large Language Models [
]
- Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [
]
- SqueezeLLM: Dense-and-Sparse Quantization [
]
- TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge [
]
- Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts [
]
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [
]
- Vidur: A Large-Scale Simulation Framework For LLM Inference [
]
- MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [
]
- MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models [
]
- SGLang: Efficient Execution of Structured Language Model Programs [
]
- SlimGPT: Layer-wise Structured Pruning for Large Language Models [
]
- SparseLLM: Towards Global Pruning for Pre-trained Language Models [
]
- Fast and Effective Weight Update for Pruned Large Language Models [
]
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity [
]
- A Survey on Efficient Inference for Large Language Models [
]
- A Survey on Inference Optimization Techniques for Mixture of Experts Models [
]
- A Survey on Large Language Model Acceleration based on KV Cache Management [
]
- APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving [
]
- AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis [
]
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [
]
- Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs [
]
- Beyond KV Caching: Shared Attention for Efficient LLMs [
]
- Compact Language Models via Pruning and Knowledge Distillation [
]
- CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation [
]
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [
]
- DeepSeek-V3 Technical Report [
]
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [
]
- Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping [
]
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads [
]
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [
]
- Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes [
]
- FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [
]
- FlashMask: Efficient and Rich Mask Extension of FlashAttention [
]
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [
]
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [
]
- L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ [
]
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [
]
- LLM Inference Serving: Survey of Recent Advances and Opportunities [
]
- Massive Activations in Large Language Models [
]
- MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache [
]
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [
]
- MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression [
]
- Multi-matrix Factorization Attention [
]
- Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [
] Pytorch
- Post-Training Sparse Attention with Double Sparsity [
]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone [
] Website
- ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [
]
- Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [
]
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [
] Pytorch
- ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs [
]
- ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [
]
- Recycled Attention: Efficient inference for long-context language models [
]
- Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [
]
- Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [
]
- SCBench: A KV Cache-Centric Analysis of Long-Context Methods [
]
- SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [
]
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [
]
- SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [
]
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [
]
- ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [
]
- SnapKV: LLM Knows What You are Looking for Before Generation [
]
- Transformers are Multi-State RNNs [
]
- Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters [
] Pytorch
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [
]
- ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty [
]
- [Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch [
]
2023
- Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [
]
- Gradient-based Intra-attention Pruning on Pre-trained Language Models [
]
- Pruning Pre-trained Language Models Without Fine-Tuning [
]
- Pruning Pre-trained Language Models with Principled Importance and Self-regularization [
]
- Structured Pruning for Efficient Generative Pre-trained Language Models [
]
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models [
]
- Structural Pruning of Large Language Models via Neural Architecture Search [
]
- SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer [
]
- TorchSparse++: Efficient Point Cloud Engine [
]
- AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [
]
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [
]
- Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [
]
- The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [
]
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time [
]
- SparseGPT: Massive Language Models Can be Accurately Pruned in one-shot. [
]
- Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation [
]
- Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning [
]
- ZipLM: Inference-Aware Structured Pruning of Language Models [
]
- VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores [
]
- Efficient Memory Management for Large Language Model Serving with PagedAttention [
]
- Efficient Methods for Natural Language Processing: A Survey [
]
- SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [
]
- A Survey on Evaluation of Large Language Models [
]
- A Survey on Model Compression for Large Language Models [
]
- Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models [
]
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [
]
- Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models [
]
- Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [
]
- Efficient Guided Generation for Large Language Models [
]
- Fine-Tuning Language Models with Just Forward Passes [
]
- Flash-Decoding for long-context inference [
]
- Gradient-Free Structured Pruning with Unlabeled Data [
]
- HO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [
]
- Knowledge-preserving Pruning for Pre-trained Language Models without Retraining [
]
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory [
]
- LLM-Pruner: On the Structural Pruning of Large Language Models [
]
- LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery [
]
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models [
]
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [
]
- Post-training Quantization for Neural Networks with Provable Guarantees [
]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [
]
- Pruning Large Language Models via Accuracy Predictor [
]
- QLoRA: Efficient Finetuning of Quantized LLMs [
]
- QuIP: Quantization with Incoherence Processing [
]
- RPTQ: Reorder-based Post-training Quantization for Large Language Models [
]
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [
]
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [
]
- Sparse Fine-tuning for Inference Acceleration of Large Language Models [
]
- Sparse Iso-FLOP Transformations for Maximizing Training Efficiency [
]
- Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [
]
- Ten Lessons We Have Learned in the New Sparseland: A Short Handbook for Sparse Neural Network Researchers [
]
- The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [
]
- Training Transformers with 4-bit Integers [
]
- Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering [
]
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [
]
- FasterTransformer [
]
2022
- Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm [
]
- TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models [
]
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads [
]
- Creating Sparse GPT-3 Models with Iterative Pruning [
]
- LoRA: Low-rank adaptation of large language models [
]
- SPDY: Accurate Pruning with Speedup Guarantees [
]
- Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [
]
- A Fast Post-Training Pruning Framework for Transformers [
]
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [
]
- Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning [
]
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [
]
- Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks [
]
- Transformer Acceleration with Dynamic Sparse Attention [
]
- An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [
]
- The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models [
]
2021
- Post-training deep neural network pruning via layer-wise calibration [
]
- BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction [
]
- Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [
]
- A Greedy Algorithm for Quantizing Neural Networks [
]
- Channel Permutations for N:M Sparsity [
]
- Accelerating Sparse Deep Neural Networks [
]
- Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks [
]
2020
- Fast Sparse ConvNets [
]
- Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference [
]
- Movement Pruning: Adaptive Sparsity by Fine-Tuning [
]
- GPU Kernels for Block-Sparse Weights [
]
2019
2018
2017
- DSD: Dense-Sparse-Dense Training for Deep Neural Networks [
]
- Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon [
]
2016
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding [
]
1993
1989
References
Hot
- https://github.com/horseee/Awesome-Efficient-LLM
- https://github.com/DefTruth/Awesome-Diffusion-Inference
- https://github.com/DefTruth/Awesome-LLM-Inference
- https://github.com/AmberLJC/LLMSys-PaperList
- https://github.com/Hannibal046/Awesome-LLM
- https://github.com/AmadeusChan/Awesome-LLM-System-Papers
- https://github.com/KnowingNothing/compiler-and-arch
- https://papercopilot.com/paper-list
- https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management
- https://github.com/October2001/Awesome-KV-Cache-Compression
Cold
- https://github.com/he-y/Awesome-Pruning
- https://github.com/htqin/awesome-model-quantization
- https://github.com/csyhhu/Awesome-Deep-Neural-Network-Compression
- https://github.com/AojunZhou/Efficient-Deep-Learning
- https://github.com/chester256/Model-Compression-Papers