Knowledge-preserving Pruning for Pre-trained Language Models without Retraining

This is a retraning-free structured pruning approach.

Method

Key idea
Selecting pruning targets
- Neurons and attention heads that minimally reduce the PLM’s knowledge
Iterative pruning
- Use knowledge reconstruction for each sub-layer to handle the distorted inputs by pruning.
K-pruning (Knowledge-preserving pruning)
knowledge measurement
knowledge-preserving mask search
knowledge-preserving pruning

Transformer Block consists of MHA and MLP.

The model-wise predictive knowledge loss is defined as the KL-divergence of logits between the pruned model and the dense model.

The sub-layerwise representational knowledge loss is defined as the F-norm (MSE loss) of the outputs.

The improtance scores are defined as:

where $\lambda = \\{0.00025, 1\\}$ and $\mu = 64$ .

总之，这些超参数的引入用来均衡，跨层与跨算子的比较。