Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks
Combine weight sparsity and activation sparsity
Table of Contents
Method
sparse weight and dense activation
- a) Combine: multiple sparse weight structures are overlaid to form a single dense entity. This is done offline as a preprocessing step.
- (b) Multiply: each element of the activation is multiplied by the corresponding weight elements in the dense entity (Hadamard product).
- (c) Route: the appropriate element-wise products are routed separately for each output.
- (d) Sum: routed products are aggregated and summed to form a separate result for each sparse entity.
sparse weight and sparse activation
- a) Combine: multiple sparse weight structures are overlaid to form a single dense structure. This is done offline as a preprocessing step.
- (b) Select: a k-WTA component is used to determine the top-k activations and their indices.
- (c) Multiply: each non-zero activation is multiplied by the corresponding weight elements in the dense structure (Hadamard product).
- (d) Route: the appropriate element-wise products are routed separately for each output.
- (e) Sum: routed products are aggregated and summed to form a separate result for each sparse matrix.