Join the community • Contribute to the library
Find below an overview of pruning techniques, while here a curated collection of resources on the topic.
The pruning algorithm usually can be divided in three main parts:
- The selection of the parameters to prune.
- The methodology to perform the pruning.
- The fine-tuning of the remaining parameters.
Given a network, the lottery ticket hypothesis suggests that there is a subnetwork that is at least as accurate as the original network. Pruning, which is a technique for removing weights from a net without affecting its accuracy, aims to find this lottery ticket. The main advantages of a pruned network are:
- The model is smaller: it has fewer weights and therefore has a smaller memory footprint.
- The model is faster: the smaller weights can reduce the number of FLOPS of the model.
- The training of the model is faster.
These "winning tickets" or subnetworks are found to have special properties:
- They are independent of the optimizer.
- They are transferable between similar tasks.
- The winning tickets for large-scale tasks are more transferable.
Various pruning techniques have been proposed in the literature. Here is an attempt to categorize common traits, namely scale, data dependency, granularity, initialization, reversibility, element type, and timing.
flowchart LR
A["Pruning \n Types"] --> B["Scale"]
B --> BA["Structured"]
B --> BB["Unstructured"]
A --> C["Data \n Dependency"]
C --> CA["Data \n Dependent"]
C --> CB["Data \n Independent"]
A --> D["Granularity"]
D --> DA["One-Shot"]
D --> DB["Iterative"]
A --> E["Initialization"]
E --> EA["Random"]
E --> EB["Later Iteration"]
E --> EC["Fine-Tuning"]
A --> F["Reversibility"]
F --> FA["Masked"]
F --> FB["Unmasked"]
A --> G["Element Type"]
G --> GA["Neuron"]
G --> GB["Connection"]
A --> H["Timings"]
H --> HA["Static"]
H --> HB["Dynamic"]
Unstructured pruning occurs when the weights to be pruned are individually targeted without taking into account the layer structure. This means that, once the pruning principle is defined, the selection of weights to be eliminated becomes fairly simple. Since the layer structure is not taken into account, this type of pruning may not improve the performance of the model. The typical case is that as the number of pruned weights increases, the matrix becomes more and more sparse; sparsity requires ad hoc computational techniques that may produce even worse results if the tradeoff between representation overhead and the amount of computation performed is not balanced. For this reason, the performance increase with this type of pruning is usually only observable for a high pruning ratio. Structured Pruning, instead of focusing on individual weights, attempts to prune an entire structure by producing a more ordered sparsity that is computationally easier to handle, sacrificing the simplicity of the pruning algorithm.
The distinction between pruning techniques that use only weight as information for the pruning algorithm (Data Independent) and techniques that perform further analyses that require some input data to be provided (Data Dependent). Usually data-dependent techniques, because they require performing more calculations, are more costly in terms of time and resources.
One-shot techniques establish a criterion, such as the amount of pruning weight or model compression, and perform pruning in a single pass. Iterative techniques, on the other hand, adapt their learning and pruning ratio through several training epochs, usually producing much better results in terms of both compression achieved and limited accuracy degradation.
Especially when performing iterative pruning with different pruning ratios, there are several ways to set the initial weight between pruning stages. It can be set randomly each time, or kept from the previous era, precisely adjusting the remaining weights to balance those that have been pruned. Another technique is to take a later iteration as a starting point, using a trade-off between a random random and maintaining the same weight from the previous epoch; in this way, the model has more freedom to adapt to the pruned weight and generally adapts better to the change.
One of the problems with pruning is that some of the weights that are removed in the first iterations may actually be critical, and their saliency may be more pronounced than pruning increases. For this reason, some techniques, instead of completely removing the weights, adopt a masking technique that is able to maintain and restore the value of the weights if in a later iteration they start to become relevant.
They are simply the different types of pruned elements, whether it can be a connection between two neurons or a neuron.
Dynamic pruning is performed at runtime. It introduces some overhead but can be adaptively performed computation by computation. Static pruning is performed before the deploying the model.
The techniques explored in the literature can be grouped into clusters:
- Magnitude-based
- Filter and feature map selection
- Optimization-based
- Sensitivity-based
flowchart LR
A["Pruning \n Techniques"] --> B["Magnitude \n Based"]
A --> C["Filter/ \n Feature Map \n Selection"]
C --> CA["Variance Selection"]
C --> CB["Entropy Based"]
C --> CC["Zeros Counting \n (APoZ)"]
C --> CD["Filter Weight \n Summing"]
C --> CE["Geometric Mean \n Distance"]
A --> D["Optimization \n Based"]
D --> DA["Greedy Search"]
D --> DB["LASSO"]
A --> E["Sensitivity \n Based"]
E --> EA["First Order \n Taylor Expansion"]
E --> EB["Hessian \n Approximation"]
click CB "https://arxiv.org/abs/1706.05791"
Under the hypothesis that smaller weights have a minor impact on the model accuracy, the weights that are smaller than a given threshold are pruned. To enforce the weight to be pruned, some regularization can be applied. Usually
Inboud Pruning [1]
The input pruning method targets the number of channels on which each filter operates. The amount of information each channel brings is measured by the variance of the activation output of the specific channel.
Where
Pruning of the entire network is performed sequentially on the layers of the network: from lower to higher layers, pruning is followed by fine-tuning, which is followed by pruning of the next layer. Speeding up the input pruning scheme is directly achieved by reducing the amount of computation that each filter performs.
Reduce and Reuse [1]
To find candidates for pruning, the variance of each filter output on a sample of the training set is calculated. Then, all filters whose score is less than the percentile
The pruning performed requires adaptation, since the next layer expects to receive inputs of the size prior to pruning and with similar activation patterns. Therefore, the remaining output channels are reused to reconstruct each pruned channel, through the use of linear combinations.
Where
The operation of A is added to the network by introducing a convolution layer with filters of size 1 × 1, which contains the elements of A.
Entropy Based Pruning [2]
Entropy-based metric to evaluate the weakness of each channel: a larger entropy value means the system contains more information. First is used global average pooling to convert the output of layer
Where,
APoZ [3]
It is defined Average Percentage of Zeros (APoZ) to measure the percentage of zero activations of a neuron after the ReLU mapping.
Where,
The higher mean APoZ also indicates more redundancy in a layer. Since a neural network has a multiplication-addition-activation computation process, a neuron which has its outputs mostly zeros will have very little contribution to the output of subsequent layers, as well as to the final results. Thus, we can remove those neurons without harming too much the overall accuracy of the network.
Empirically, it was found that by starting by trimming from a few layers with high mean APoZ, and then progressively trimming neighboring layers, it is possible to rapidly reduce the number of neurons while maintaining the performance of the original network. Pruning the neurons whose APoZ is larger than one standard derivation from the mean APoZ of the target trimming layer would produce good retraining results.
Filter Weight Summing [4]
The relative importance of a filter in each layer is measured by calculating the sum of its absolute weights.
This value gives an expectation of the magnitude of the output feature map. Then each filter is sorted based on
Geometric Median [5]
The geometric median is used to get the common information of all the filters within the single ith layer:
$$ x^{GM} = argmin_x \sum_{j'} || x - F_{i,j'} ||2$$ $$ x \in R ^ {N_i \times K \times K} $$ $$ j'\in [1, N{i+1}] $$
Where
Then is found the filters that are closer to the geometric mean of the filters and pruned since it can be represented by the other filters.
These calculations are quite complex, but it is possible to find the filter that minimizes the sum of the distance with the other filters.
$$ F_{i,x^*} = argmin_x \sum_{j'} || x - F_{i,j'} ||2$$ $$ x \in [F{i,1} \dots F_{i,N_{i+1}}] $$ $$ j'\in [1, N_{i+1}] $$
Since the selected filter(s),
OPTIMIZATION BASED [6]
This method “starts from an empty network and greedily adds important neurons from the large network” of a pre-trained model. With this technique:
-
you should obtain better accuracy than the other way around
-
you can prune also randomly weighted networks
Researches on the topic suggest not to then re-train the new sub-network, but just to fine-tune it “to leverage the information from the large model”. [7]
Obd [8]
This technique dating back to 1989 has been further developed in new and better-performing methodologies, but we report it as it was a pillar in the pruning literature. The Optimal Brain Damage technique uses “second-derivative information to make a tradeoff between network complexity and training set error”, which in this case means “selectively deleting weights” based on “two terms: the ordinary training error” and “some measure of the network complexity”.
Lasso [9]
The Least Absolute Shrinkage and Selection Operator, and the extensions derived from it, compared to other techniques, perform pruning at a higher level. It acts at the level of weights, neurons as well as filters, creating a superior sparsity on the level of weights.
In this technique, the loss is constantly increased until some degree of sparsity is observed”; “there is no exact method for obtaining a satisfactory value”, and it depends also on the size of the dataset.
SENSITIVITY BASED [10]
Large networks might contain multiple connections that contribute only a small part to the overall accuracy. In addition, redundant connections could compromise the generalization capability that a deep network should have, or its efficiency in being trained especially on edges. By disabling connections between neurons selectively, we can reduce network connections and the number of operational needs for network training. Furthermore, the connection mask extracted can be reused for training the network from scratch, with only a small accuracy loss. [11]
The sensitivity measure is the probability of output inversions due to input variation with respect to overall input patterns. This kind of pruning can be repeated iteratively until no more can be removed under a given performance requirement.
Taylor expansion - first order (lighter than the following second order one) [12]
The weights and their gradients are used for ranking filters, which require far less memory consumption than feature maps. Element-wise product of weights and gradients is able to gather exactly the first-degree Taylor polynomial of corresponding feature map w.r.t loss through back-propagation, meaning it approximates the actual damage.
Hessian approximation (HAP) - second order [13]
The HAP method is a structured pruning method that cuts channels based on their second-order sensitivity, which measures the flatness/sharpness of the lost landscape. Channels are sorted based on this metric, and only insensitive channels are pruned.
[1] Polyak, Adam and Wolf, Lior, "Channel-level acceleration of deep face representations", in IEEE Access (2015).
[2] Luo, Jian-Hao, and Jianxin Wu, "An Entropy-based Pruning Method for CNN Compression", in arXiv preprint arXiv:1706.05791 (2017).
[3] Hu, H., Peng, R., Tai, Y. W., & Tang, C. K., "Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures" in arXiv preprint arXiv:1607.03250 (2016).
[4] Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. "Pruning filters for efficient convnets". arXiv preprint arXiv:1608.08710 (2016).
[5] He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. . "Filter pruning via geometric median for deep convolutional neural networks acceleration". In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019).
[6] Ekbom, L. ."Exploring the LASSO as a Pruning Method". (2019).
[7] Ye, M., Lemeng, W., & Qiang, L. . "Greedy optimization provably wins the lottery: Logarithmic number of winning tickets is enough.". Advances in Neural Information Processing Systems 33 (2020)
[8] LeCun, Y., Denver, J., & Solla, S.. "Optimal brain damage.". Advances in neural information processing systems 2 (1989).
[9] Ekbom, L. . "Exploring the LASSO as a Pruning Method.". (2019).
[10] Dajie Cooper, Y. . "Sensitivity Based Network Pruning: A Modern Perspective.".
[11] Xiaoqin, Z. et al. . "A sensitivity-based approach for pruning architecture of Madalines.". Neural Computing and Applications 18.8 (2009).
[12] Mengqing, W. . "Pruning Convolutional Filters with First Order Taylor Series Ranking.".
[13] Shixing, Y. et al. . "Hessian-aware pruning and optimal neural implant.". Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. (2022).