Knowledge-preserving Pruning for Pre-trained Language Models without Retraining

This is a retraning-free structured pruning approach.

Method

  • Key idea
  • Selecting pruning targets
    • Neurons and attention heads that minimally reduce the PLM’s knowledge
  • Iterative pruning
    • Use knowledge reconstruction for each sub-layer to handle the distorted inputs by pruning.
  • K-pruning (Knowledge-preserving pruning)
  • knowledge measurement
  • knowledge-preserving mask search
  • knowledge-preserving pruning

Transformer Block consists of MHA and MLP.

The model-wise predictive knowledge loss is defined as the KL-divergence of logits between the pruned model and the dense model.

The sub-layerwise representational knowledge loss is defined as the F-norm (MSE loss) of the outputs.

The improtance scores are defined as:

where and .

  • 不同层之间的score是可以互相比较的吗?
  • 对于MLP, 取值非常小,只看predictive loss,可以跨层比较
  • 但是对于MHA, 取值比较大,predictive/representational 都看,两者兼顾
  • MHA 与 MLP 也可以互相比较?
  • 通过配比 来实现

总之,这些超参数的引入用来均衡,跨层与跨算子的比较。

Results