Exploiting prunability for person re-identification

Recent years have witnessed a substantial increase in the deep learning (DL) architectures proposed for visual recognition tasks like person re-identification, where individuals must be recognized over multiple distributed cameras. Although these architectures have greatly improved the state-of-the-art accuracy, the computational complexity of the convolutional neural networks (CNNs) commonly used for feature extraction remains an issue, hindering their deployment on platforms with limited resources, or in applications with real-time constraints. There is an obvious advantage to accelerating and compressing DL models without significantly decreasing their accuracy. However, the source (pruning) domain differs from operational (target) domains, and the domain shift between image data captured with different non-overlapping camera viewpoints leads to lower recognition accuracy. In this paper, we investigate the prunability of these architectures under different design scenarios. This paper first revisits pruning techniques that are suitable for reducing the computational complexity of deep CNN networks applied to person re-identification. Then, these techniques are analyzed according to their pruning criteria and strategy and according to different scenarios for exploiting pruning methods to fine-tuning networks to target domains. Experimental results obtained using DL models with ResNet feature extractors, and multiple benchmarks re-identification datasets, indicate that pruning can considerably reduce network complexity while maintaining a high level of accuracy. In scenarios where pruning is performed with large pretraining or fine-tuning datasets, the number of FLOPS required by ResNet architectures is reduced by half, while maintaining a comparable rank-1 accuracy (within 1% of the original model). Pruning while training a larger CNNs can also provide a significantly better performance than fine-tuning smaller ones.

(2021) 2021: 22 Page 2 of 31 compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices), and for real-time processing (e.g., video surveillance and monitoring, virtual reality), their time and memory complexity and energy consumption should be reduced [1]. Consequently, there is a growing interest in effective methods able to accelerate and compress deep networks. Providing a reasonable trade-off between accuracy and efficiency has become an important concern in person re-identification (ReID), a key function needed in a wide range of video analytics and surveillance applications. Systems for person ReID typically seek to recognize the same individuals that previously appeared over a non-overlapping network of video surveillance cameras (see illustration in Fig. 1). These systems face many challenges in real-world applications that are related to either the image data or the network architecture. Data-related challenges that affect the ReID accuracy include the limited availability of annotated training data, ambiguous annotations, domain shifts across camera viewpoints, limitations of person detection and tracking techniques, occlusions, variations in pose, scale and illumination, and low-resolution images.
To address these issues, state-of-the-art DL models (e.g., deep Siamese networks) for person ReID often rely CNNs for feature extraction to learn an embedding in end-toend fashion, where similar image pairs (with the same identity) are close to each other, while dissimilar image pairs (with different identities) are distant from each other [2][3][4][5][6][7][8][9][10]. While state-of-the-art approaches can provide a high level of accuracy, achieving this performance comes with the cost of millions or even billions of parameters, a challenging training procedure, and the requirement for GPU acceleration. For instance, the ResNet50 CNN [11], with its 50 convolutional layers, contains about 23.5M parameters (stored in 85.94MB of memory) and requires 6.3 billion floating point operations (FLOPs) to process a color image of size 256 × 128 × 3. The complexity of these networks limits their deployment in many real-time applications, or on resource-limited platforms. Consequently, there has been a great deal of interest by the computer vision and machine learning communities to develop methods able to accelerate and compress such networks, as well as other DL architectures, without compromising their predictive accuracy. The time complexity of a CNN generally depends more on the convolutional layers, while the fully connected layers contain the most of the parameters (memory complexity). Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers [12,13]. State-of-the-art approaches for acceleration and compression of deep neural networks can be divided into five categories-low-rank factorization, transferred convolutional channels, knowledge distillation, quantization, and pruning.
Low-rank factorization approaches [14][15][16][17][18][19] accelerate CNNs by performing matrix decomposition to estimate information parameters of a network. However, low-rank approaches suffer from a number of issues-computationally expensive matrix decomposition, layer-by-layer low-rank approximation that diminishes the possibility of global compression, and extensive model retraining to achieve convergence. Some network acceleration and compression approaches [20][21][22][23], categorized as transferred convolutional channels, design special structural convolutional channels to reduce the parameter space, which eventually improves computational efficiency, but transfer assumptions are sometimes too strong that makes the learning process unstable.
Knowledge distillation approaches [24][25][26][27] train a smaller or shallow deep network (the student) using distilled knowledge of a larger deep network, called teacher. These approaches can yield improvements in terms of sparsity and generalization of the student networks, but can only be applied to classification tasks and the bounded assumption of this approach leads to inferior performance while comparing to other types of approaches. A deep neural network can be accelerated by reducing the precision of its parameters. Using quantization approaches, each parameter of a network is represented with a reduced bit rate, either by reducing the precision, employing a lookup table, or combining similar values. Most of the quantization approaches [12,28,29] require extra computational time to access a look up table, or for decoding such that the original value is restored. In contrast, pruning seeks to reduce the number of connections or retrain either the whole or part of the network with a freshly trained replacement. Pruning methods typically focus on selecting and removing the weights or channels with the least impact on performance. Thus, in addition to accelerating and compressing the network, pruning methods can provide the additional benefits such as addressing the overfitting problem and thus improve generalization. Therefore, pruning approaches have drawn a great deal of attention from the network compression community. Challenges of pruning include the lack of data for pruning during the fine-tuning phase, the computational complexity associated with retraining after a pruning phase, and the reduction of capacity to learn of a model, which can impact the accuracy when the learning step is done on the pruned model.
This paper focuses on pruning techniques [13,[30][31][32][33] since they are among the most widely used for acceleration and compression of deep neural networks, and have been shown their effectiveness on well-known CNNs, and for several general image classification problems like CIFAR10, MNIST, and ImageNet. State-of-the-art pruning techniques can be categorized according to their pruning criteria to select channels, and to their (2021) 2021: 22 Page 4 of 31 strategy to reduce channels, and are suitable for compressing DL models like Siamese networks for applications in person ReID. In particular, state-of-the-art techniques can be categorized using criteria based on weights and on feature maps. We also distinguish techniques according to pruning strategy; pruning techniques can also be distinguished among-those that (1) prune once and then fine-tune, (2) prune iteratively on trained model, (3) prune using regularization, (4) prune by minimizing the reconstruction errors, and (5) prune progressively. This paper revisits the pruning techniques that are suitable for reducing the computational complexity of CNNs applied to person ReID. These techniques are then analyzed according to their pruning criteria and strategy and according to different design scenarios. Different design pipelines or scenarios are proposed to leverage these state-of-the-art pruning methods during pretraining and/or fine-tuning. A typical design scenario consists of four stages: (1) training a CNN with a large-scale dataset from the source domain (i.e., ImageNet), (2) prune the trained large model based on some criterion to select channels to be eliminated, (3) retrain the pruned network to regain the accuracy, and finally, (4) fine-tune the retrained network using a limited dataset from the target application. A common assumption with this design scenario is that training a large and over-parameterized CNNs, using a large-scale dataset, is necessary to provide a discriminant feature representation. The pruning process used to select and reduce the network will yield a set of redundant channels that does not significantly reduce accuracy. Under this scenario, a CNN for ReID would therefore over-train on a smaller network from scratch [33][34][35][36]. Thus, most of the approaches in literature tend to prune channels of a fine-tuned network, rather than a pretrained network. This paper presents other design scenarios that apply when pruning networks that have been pretrained on large dataset, and that require a fine-tuning to a given target domain.
Finally, this paper presents an extensive experimental comparison of different pruning techniques and relevant design scenarios on three benchmark person ReID datasets-the Market-1501 [37], CUHK03-NP [38], and DukeMTMC-reID [39]. Pruning techniques are compared in terms of accuracy and complexity on different DL architectures with ResNet feature extractors, with different ReID applications in mind.
The rest of the paper is organized as follows. Section 2 provides some background on DL models for person ReID. Section 3 provides a survey of the state-of-the-art techniques for pruning CNNs. Finally, Sections 5 and 6 described the experimental methodology (benchmark datasets, protocol, and performance measures) and comparative results, respectively.

Deep neural networks for person re-identification
State-of-the-art techniques for person ReID mostly rely on two types of losses: metric learning loss and multi-class classification loss. With the first type, a dataset with images from different individuals is learned using a Siamese network that optimizes a metric loss function (such as contrastive loss, triplet loss, quadruplet loss, hard-aware point-to-set (HAP2S) loss) [2,4,7,40,41] to provide a feature embedding for pairwise similarity matching. With the second type, ReID approaches based on multi-class classification loss (such as softmax or cross-entropy loss) [42][43][44][45][46] learn part-based local features to form more informative feature descriptor, also known as ID-loss in ReID community.

Category Loss function
Metric learning Contrastive loss [4]: Triplet [47]: Triplet loss with margin [7]: Cross-entropy [2,38,50]: Part-based cross-entropy [43][44][45][46]: Table 1 provides a summary of common loss functions from both categories applied in person ReID. For all the losses, d represents the Euclidean distance, m, m 1 , m 2 denotes the margin parameters, and . + = max(., 0). In this table, X = {x} N i=1 is a training mini-batch with labels y i N i=1 . For contrastive loss, the network transforms the pair of input images x 1,i , x 2,i into feature embeddings f 1,i , f 2,i . The labels are either y i = 0 for positive pairs or y i = 1 for negative pairs. For triplet loss, we sample (x a , x p , x n ) where the anchor and the positive x a are two images from the same person, while the negative x n is an image from another person. The corresponding feature embeddings are (f a , f p , f n ). For quadruplet loss, we sample (x a , x p , x n , x k ) where the anchor and positive x a are two images from the same person, while the negative and x k are the images from different persons. The corresponding feature embeddings are (f a , f p , f n , f k ). For HAP2S loss, S p and S n denote the set of positive and negative samples respectively with respect to the anchor f a ; for magnet loss, C is the number of classes (individual), μ is the sample mean of class y, and σ 2 is the variance of all samples away from their class mean. Belonging to classification loss category, in cross-entropy loss, W y i is the weight vector of the fully connected layer with feature embedding f i . For cosine softmax loss,W y i andf i denote the normalized weight and feature vector, respectively, and for part-based cross-entropy loss, L p CE represents the cross-entropy loss of individual part, P.

Metric loss
The idea of using deep Siamese networks for biometric authentication and verification originates from Bromeley et al. [51], where two sub-networks with shared weights encode feature embeddings for pairwise matching between a query and reference (gallery) images. These networks were first used in [40] for ReID that employ three feature extraction sub-networks for deep feature learning. Then, various deep learning architectures were proposed to learn discriminative feature embeddings. Most of these architectures [4,5,7,8,41,[47][48][49] employ end-to-end training, where both feature embedding and metric are learned as a joint optimization problem.
There are a number of metric learning losses that are widely used for optimizing deep ReID architectures. Contrastive loss is used in [4] to optimize a Siamese network that minimizes the distance between samples of the same class and forces a margin between samples of different classes. Triplet loss in ReID is first used in [47] that directly optimizes an embedding layer in Euclidean space which compares the relative distances of three training samples, namely an anchor image, a positive image sample from the same individual, and a negative sample from a different individual. In [7], original triplet loss is modified by adding an additional positive-pair constraint. A different version of triplet loss, named as quadruplet loss, is proposed in [5], which enlarges inter-class variations and reduces intra-class variation. Hermans et al. [41] extend the triplet loss by designing a simple semi-hard mining that selects the hardest positive and hardest negative of each anchor in a mini-batch. In [48], a soft hard-sample mining scheme is proposed by adaptively assigning weights to hard samples. On the other hand, magnet loss [49] is formulated as a negative log-likelihood ratio between the correct class and all other classes, but also forces a margin between samples of different classes. A thorough study of all the state-of-the-art metric learning losses for ReID suggests that triplet loss is the most widely used loss to optimize the deep ReID architecture. And among all the different versions of the triplet losses, the semi-hard mining-based triplet loss proposed by Herman et al. [41] is the simplest and efficient that does not require to change the backbone architecture to get the final feature embedding.

Multi-class classification loss
There has been an alternative trend that addresses the ReID problem as multi-class classification problems, where each ID is considered as a class. The objective of classification loss is to determine whether each input pair of images are the same or not, which makes full use of the ReID label with the predicted one from the classification networks. Some of the state-of-the-art classification-based ReID approaches [2,38,50] employ crossentropy loss for image pairs in their network that takes pairwise images as inputs, and output the verification probability. Some other state-of-the-art ReID approaches [52,53] used margin-based loss to keep the largest possible separation between the positive and negative pairs.
Recently, many works focus to learn local part-based feature which adopts the simple classification loss-based network performed on multiple local parts of a single image. Most of these approaches take into account the local features either from the human body part or by diving the global features to obtain discriminative part-based feature representations. State-of-the-art approaches [44][45][46] rely on diving the human body parts based on either external pose estimation or external semantic segmentation that leverage the semantic partitions for deeply learned part-based features. However, they highly depend on the efficiency of the external pose estimation or semantic segmentation techniques. In addition to that, they are suffering misalignment issues. Thus, to address these issues, state-of-the-art approaches [43] take into account global features and then divide into parts or stripes. The advantage of using global features for local features representations is two folds: (i) does not suffer from misalignment caused by inaccurate bounding box detection, human pose changes, and various human spatial distributions and (ii) different channels of the global feature have different recognition patterns which increase the discriminative ability of the extracted feature by paying weighted attention to different parts of the human body. Thus, we are focusing on state-of-the-art methods that concentrate more on partitioning global features to form a local part-based feature representations.

Multiple losses
Recently, there have been efforts [54][55][56][57] to adopt multi-loss training strategies. More specifically, a combination of cross-entropy and triplet losses has proven to be effective to optimize ReID networks. The aim of this combination is to increase the discriminant power of the feature embedding by optimizing different objective functions. In [54], an omni-scale feature learning scheme is designed to capture the salient features at different scales that suggest optimizing the ReID network with the combination of cross-entropy and triplet losses for better performance. To achieve a similar objective of having multiscale feature learning, Niki et al. [58] proposed a pyramid-inspired deep ReID architecture where multi-loss functions combined with curriculum learning strategy to optimize the network. In [59], an attention-driven Siamese learning architecture is designed to integrate attention and attention consistency by jointly optimizing the cross-entropy loss, the identification attention loss, and the Siamese attention loss. Chen at el. [60] use a reinforcement learning technique to quantify the attention quality and provide a powerful supervisory signal to guide the learning process. Following the same trend, they use the combination of cross-entropy and triplet losses to optimize their proposed architecture. All of these ReID approaches above focus on improving the recognition accuracy without addressing the scalability issues to reduce computational costs. A few ReID approaches [61][62][63][64][65][66] seek to address the issues of computational complexity. Of these ReID approaches [61,62], few rely on distillation-based approaches where knowledge is distilled from a deeper CNN (teacher model) to a lighter CNN (student model). Other approaches [63,65,66] rely on hashing to learn binary representation instead of realvalue features for faster computation. In contrast to these approaches, our proposed ReID approach relies on the pruning techniques that compresses the deep CNN with a marginal reduction of recognition accuracy. Additionally, we propose different design scenarios or pipelines for leveraging a pruning method during the deployment of a CNN for a target ReID application domain.

Techniques for pruning CNNs
The objective of pruning is to remove unnecessary parameters from a neural network, while trying to maintain a comparable accuracy. Currently, pruning techniques operate on two different levels. First, techniques for weight-level pruning focus on pruning individual weights of a network. In contrast, techniques for channel-level pruning focus on removing all the parameters of the output and input channels of convolution layers. While weight pruning techniques can achieve high compression rate and good acceleration, its performance depends on a good sparse convolution algorithm which is unavailable and does not perform well on all platforms. In this paper, we focus on channel pruning techniques which do not rely on other algorithms and have been extensively studied in literature. This section presents a survey of channel pruning techniques and summary of experimental results reported in the literature. Table 2 presents the main properties of different pruning techniques according to strategy used to reduce channels. In order to facilitate the analysis of different pruning methods, we also categorize techniques according to the type of pruning criterion. In this table, "prune in one step" refers to techniques that prune the network one time and then fine-tune the network [33,34,67]. "Prune iteratively" is a type of pruning that is done iteratively on a trained model that alternates between pruning and fine-tuning [32]. Pruning by regularization is usually done by adding a regularization term to the original loss function in order to leave the pruning process for the optimization [69,70]. Pruning by minimizing the reconstruction error is a family of algorithms that tries to minimize the difference of outputs between the pruned and the original model. "Progressive pruning, " while very similar to iterative pruning, differs in that it can start directly from a model that was not trained and progressively prune it during training.

Channel pruning taxonomy
One key challenge of pruning neural networks is selecting the pruning criteria. It should allow to discern the parameters that contribute to accuracy and the ones that do not. Another challenge is finding an optimal pruning compression. This compression ratio is essential to find a compromise between the reduction of complexity for the model and the loss of accuracy. Finally, one challenge is the retraining and pruning schedule of the model. Punning can be performed in one iteration but the damage caused to the network may be considerable. On the counterpart, we could prune and retrain iteratively to reduce this damage at each iteration, but this will take longer to apply. The retraining of the pruned network may also cause the model to overfit or get caught in local minimums.

Description of methods
This subsection presents different pruning algorithms for each pruning family in the taxonomy. To ease our notation, we refer to a convolution tensor as W with W ∈ Table 2 Main properties of different channel pruning techniques

Strategy Methods Criteria
Prune in one step L1 [33] W e i g h t s : Entropy [34] Feature maps: Prune iteratively Taylor [32] Feature maps: Channel pruning [35] Feature maps: arg min Prune progressively PSFP [36] W e i g h t s : S j = |w k | 2 (2021) 2021:22 Page 9 of 31 R n out ×n in ×k×k , n in the number of input channels, n out the number of output channels, and k the kernel size. An output channel tensor i is then defined as W i , and an individual weight is defined as w. For feature map, H represents an output of a convolution layer and H i then represents the output channel of a feature map. For ease of notation, we do not mention the layer index unless necessary; therefore, W or H can be any convolution layer or feature map at any index.

Criteria based on weights
The L1 [33] pruning algorithm is a layer-by-layer method which means it will prune the network one layer at a time. This algorithm's pruning criteria are simple and could be implemented using Algorithm 1. The retraining could be done in two different ways: Initialize the model parameter M 5: for each convolution layer do 6: Calculate l 1 −norm for each channel 7: Select the N lowest l 1 −norm depending on the pruning rate 8: Remove the N selected channels 9: end for 10: Re-Train the pruned network 11: Output: Compact model with parameters M 1 Prune once over multiple layers and retrain (more adapted for resilient layers) 2 Prune channels one by one and retrain each time (more adapted for layers that are less resilient) For this algorithm, we chose to mix the two retraining methods to come up with pruning N channels before retraining. The second weight method that will be presented is the redundant channel pruning [67]. This method's idea is to pruned channels that are similar to the ones that are kept. To do so, the authors proposed to regroup each channel of a layer in n f clusters depending on a similarity score being higher than a preassigned threshold τ . To determine the similarity between these channels, the authors proposed to use the cosine similarity between the weights of the channels.
With the calculation of SIM C of two output channel given below: gives us the ability to determine the similarity between two channels by calculating the cosine of the angle between two vectors of dimension n. The pruning of one specific layer could be done in 2 steps: 1 Group the channels in the same cluster if cos(θ) from Eq. 1 is above the threshold τ 2 Randomly sample one channel in each cluster and pruned the remaining ones of each cluster.
The threshold τ acts as the compression ratio in this pruning algorithm where a low threshold means a high compression rate and vice versa (see Algorithm 2). Initialize the model parameter M 5: for each convolution layer do 6: Calculate the similarity score for each pair of channel 7: Separate the channels into two clusters based on threshold τ 8: Select the N lowest l 1 −norm depending on the pruning rate 9: Randomly sample one channel in each cluster and pruned the remaining ones of each cluster. 10: end for 11: Re-Train the pruned network 12: Output: Compact model with parameters M The third weight-based method that will be presented is the Auto-Balanced pruning [70] that uses the same pruning criteria as the L1 algorithm which is a L1 norm of weight kernels to determine the ranking of the channels. But this method adds a regularization term during the training to transfer the representational capacity of the channels we want to prune to the remaining ones. In order to calculate this transfer of representational capacity, the authors proposed to separate the channels in two subsets at the beginning of each pruning iteration. In order to assign the channels to their subset, the authors used the L1 norm of the weights of the channels. The vec function is used to flatten the weight matrix into a vector and M i,j the metric measuring the importance. Here, we use the notation of W i,j with i representing the layer index and j the output channel index.
Once the L1 score has been calculated for each channel, they are then assigned to one of the subsets depending on the threshold θ which is fixed depending on the desired number of remaining channels per layer. The channels in subset R (remaining) and subset P (to pruned) are then adjusted with an L2 regularization term. The following equations are used to calculate this L2 adjustment factor: The cost function for training is changed with Eq. 9 where L 0 represents the original cost function.
This enables the model to penalize the weak channels and stimulate the strong ones. This method adds two hyper-parameters in the training which are α and r. α is the regularization factor and the vector r is the target of remaining channels in each layer (see Algorithm 3). Initialize the model parameter M 5: for each convolution layer do 6: Divide the channels into R(remain) subset and P(Prune) subset using l 1 −norm 7: Optimize the Equation The last weight-based method is the progressive soft pruning [36] where their pruning criterion is the same as the L1 method (L2 norm of the weights). The main difference with this method is they proposed an interesting pruning scheme that allows pruning during the fine-tuning step. The authors proposed to use soft pruning which means instead of removing the channels during the pruning, they set the weights to 0 and allow these channels to be updated during the retraining phase. This pruning scheme is very interesting since the model keeps its original dimension during the retraining phase. The authors also proposed to add a progressive pruning scheme where at each pruning iteration, the compression ratio is increased in order to get a shallower network. Once these iterations of pruning and retraining are completed, they do a last channel ranking using a pruning criteria and they discard the lowest channels depending on the compression ratio. Their pseudo-code for the progressive soft pruning scheme can be viewed in Algorithm 4. In Calculate the l 2 −norm for each channel 9: Calculate the pruning rate P at this epoch using P i and D 10: Select the N lowest l 2 −norm depending on the pruning rate 11: Zeroize the weights W of the selected channels 12: end for 13: end for 14: Obtain the compact model with parameters M' from M 15: Output: Compact model with parameters M the article, they used the L1 or L2 norm of the weights as pruning criteria which means this method could be categorized as a weight-based method.
The L represents the number of layers in the model, i represents the layer number, W represents the weights of a channel, and N is the number of channels to prune. The pruning rate P is calculated at each epoch using the pruning rate goal P i for the corresponding layer i and the pruning rate decay D. To calculate the pruning rate, we can use Eq. 11 The a, b, and k values can be calculated by solving Eq. 12 FPGM [68] is a new technique that focuses on using a geometric median to prune away output channels. A geometric median is defined as follows: given a set of n points A = [ a 1 , a 2 , ..., a n ] with a i ∈ R d , find a point x * ∈ R d that minimizes the sum of the Euclidean distances to them: Using Eq. 13, a geometric median F GM i for all the filters of a layer i can be found: In order to select, non-important output channels, the author proposed to find the channels that have the same or similar value of W GM i which translates to: Since geometric median is a non-trivial problem, it is quite computationally intensive; therefore, the authors propose to relax the problem by assuming that: This transforms Eq. 14 to: The algorithm of FPGM is summarized in Algorithm 5.  [69] is an adaptive output channel pruning technique that, instead of focusing on a criterion, tries to find an optimal number of output channels that can be pruned away given an error tolerance rate. This technique is a min-max game of two modules, The Adaptive Filter Pruning (AFP) module and the Pruning Rate Controller (PRC). The goal of the AFP is to minimize the number of output channels in the model while the PRC tries to maximize the accuracy of the remaining set of output channels. This technique considers a model M can be partitioned into two sets of important channels I and unimportant channels U.

Algorithm 5 Algorithm Description of FPGM
U i represents all the unimportant channels of a layer i. It is selected by selecting α% channels of the result of the sort operation on the L1 norm of each output channel. Once an U i is selected, the authors propose to add an additional penalty to the original loss function in order to prune without loss of accuracy while helping the pruning process. The original loss function would then become: where C( ) is the original cost function to optimize the original model parameters, and λ A is the L 1 regularization term. While this optimization helps pushing the channels to have zero sum of absolute weights, it can take some epochs; therefore, the authors propose an adaptive weight threshold (W i ) for each layer i. Any channels with L1 norm below this threshold will be removed. While this value is given by the PRC, for the first epoch, it is found by using a binary search on the histogram of sum of absolute weights. The AFP minimizes the number of output channels in the model using Eq. 19. The AFP can be summarized in Algorithm 6, and the loss function of AFP can be written as: For the PRC, the adaptive threshold W A is updated as follows: with δ w the constant used to increase or decrease the pruning rate. T r is calculated as follows: where ξ is the accuracy of the unpruned network, ε is the tolerance error, and C(#w) is the accuracy of the model with the remaining filter #w. The regularization constant is also computed as follows: with λ the initial regularization constant. By alternating between the AFP and the PRC, the authors propose a system that prunes at each epoch in an adaptive and iterative way (see Algorithm 6). for each convolution layer do 5: Select an α% of output channels with the lowest l 1 −norm

Criteria based on feature maps
In the channel-based approach, we are going to present 3 algorithms which are ThiNet [71], Channel Pruning [35], and Entropy Pruning [34]. The first two algorithms have the same idea behind their pruning algorithm which is minimizing the difference in the activation maps but they diverge with their minimization technique. ThiNet's goal is to find a subset of channels that minimize the difference in the output at layer i+1 (feature map). ThiNet uses greedy algorithm to find which subset of channels to eliminate and keep the input at layer i+2 almost intact. To find the subset of channels to prune, the authors proposed to use a greedy algorithm where they compute the value for each channel in a layer and assign the lowest value to the subset. They repeat this method until our pruning subset respects the defined compression ratio. To calculate the input of the feature map in layer i+2, we can use Eq. 24.
where i represents the layer, j the channel index, and k the kernel size of the channel. To compute the value of a channel, the authors proposed to used Eq. 25 wherex is equal to W i+1,j in Eq. 24. This greedy method is repeated for each layer needed to be pruned in the model. The Channel Pruning method also has the goal to minimize the difference in the output (feature map) but their method is to find a subset of channels with a LASSO regression.
β represents channel mask that decides whether the channel is pruned or not. If β is zero, then the channel is no longer useful. The compression ratio is defined with λ. The n represents the number of channels and n represents the number of remaining channels. During the pruning iterations, the W in Eq. 26 is fixed which leaves us with only one variable to minimize which is β. The LASSO regression is used to find this β mask that minimizes the difference in the output. As in the ThiNet method, this method also requires to redo these steps for every layer needed to be pruned.
The entropy pruning [34] method is also a layer-by-layer algorithm but instead of trying to minimize the difference in the output like the two channel methods above, they used a different criteria based on the entropy of the feature maps produced by the channels. The idea behind their criteria is that a low entropy in the feature maps of a channel will most likely be less important in the decision of the network. With the entropy criterion defined as: Pruning for layer i is done according to Algorithm 7. Taylor's [32] pruning algorithm seeks to minimize the cost function. It approximates the change in this function if the channel is pruned according to: where C represents the cost function and D the dataset. C(D, H i,j = 0) is the cost value if channel H i,j is pruned. The idea is to find a subset of channels H i,j to prune while minimizing the difference with the original cost function were these channels were used. This is represented in the equation by calculating the difference between cost function with the channels excluded and the cost function with the channels included. Using a Taylor Run all the data through the network and collect features for each convolution layer 6: for each convolution layer do 7: Convert the activation maps into a vector of dimension n out (number of channels) using a global average pooling. 8: For each channel j divide the distribution into m bins and calculate the entropy using 27 9: Remove channels with the lowest entropy value according to the pruning rate 10: end for 11: Output: Compact model with parameters M expansion to solve this minimization, the authors found that the difference in the cost function with the channels pruned could be approximated with the activation (feature map) and the gradient of the channel which can be calculated during back-propagation.
Each channel ranking value is normalized using a l2-normalization. This normalization is done on each layer individually in order to facilitate the comparison between layers since this method ranks channels across all layers (see Algorithm 8): Evaluate the importance of output channels using X and Equation 29 7: for each convolution layer do 8: Remove channel with the least importance 9: end for 10: Fine-tune the pruned network 11: if Stopping condition is True

Output-based
The second output-based method is the Neuron Importance Score Propagation (NISP) [72] where the pruning is done by back-propagating channel scores across the model to determine which ones to prune. The intuition behind their idea is to use a feature ranking method on the last layer before the classification since this layer is the one to play a more significant role in our application. Once every feature has an associated score, the authors propose to back-propagate that score into the network to have an importance score for each channel in the network. The importance score is then used to determine which channels we pruned and which one we retain by using a predefined compression ratio for each layer in the model. The score is back-propagated using Eq. 30.
where W is the weights, j is the neuron or channel, i is the layer, and k is the number of connections from that neuron to the next layer. This equation represents a weighted sum of the scores in the subsequent layers.

Critical analysis of pruning methods
The main difference between methods using a weight-based criteria versus methods based on feature map criteria is that they are not dependent on a dataset since weight statistics do not depend on output of a CNN. Methods based on feature maps need a dataset in order to compute the output of convolution layers or its gradients.
The chosen criteria usually depend on the desire to simplify the pruning steps at the expense of a lower accuracy-some more complex criteria that require more computations allow preserving a high level of accuracy. If training and pruning time is an issue, i.e., applications with design constraints, and that requires fast deployment, simple criteria like L1 and L2 norm are more suitable. However, if there are no complexity constraints, some more complex pruning criteria, like the minimization in the difference of activation or cost functions, can outperform the simpler criteria, at the expense of more computations and time.
Some of the techniques also differ in terms of how channels are pruned, some prune layer-by-layer [33,36], while others prune across layers [32]. One of the differences between across layer and layer-by-layer pruning is the imbalance in terms of pruning. An across layer pruning does not prune each layer evenly, and the method can possibly prune lower level layers more than higher level layers, and vice versa. Depending on the CNN architecture and pruning algorithm, pruning across layers may not yield the desired reduction. Layer-by-layer pruning can guarantee that all the layers will be pruned and therefore undergo a more even reduction at each layer.
Recently, some techniques [36,68] also adopt a new soft pruning approach. Soft pruning differs from hard pruning because it only resets pruned channels to zero instead of completely removing them. Therefore, soft pruned channels have a chance to recover. These techniques have been shown to have achieved state-of-the-art performance. Table 3 summarizes a comparative experimental analysis of different pruning techniques. All the results reported in this table have been taken from the corresponding papers. Experimental performance indicates that VGG16 processing can accelerate by up to 2.5 times at the expense of increased error of 1% on the ImageNet dataset. Comparing L1 [33] versus Auto-Balanced [70] techniques, both based on a weight-based criteria, we observe that the Auto-Balanced techniques can obtain higher compression ratios because  of their regularization term. If we compare weight-based and channel-based approaches, we observe that using channel-based provides a higher compression while maintaining similar accuracy. For the comparison between output and channel-based approaches, ThiNet [71] outperforms Taylor [32] in accuracy and complexity.

Design scenarios with pruning
Most DL models for person ReID use pretraining, and then fine-tune the model to the task or target application domain. Pretraining is typically performed using a large-scale dataset in order to prime CNN parameters towards relevant optimization solutions. In many cases, CNNs are pretrained on ImageNet since this public dataset has a large amount of diverse training samples from different classes which improves the CNN capacity to generalize. In person ReID, pretrained models have proven to be more successful than models that were trained from scratch directly on the task dataset.
Once the model has been pretrained, the next step is fine-tuning to map the model's parameters from our pretraining source domain to our target application domain. It is crucial that the task dataset be similar to pretraining data. As described in [73], the best fine-tuning practices depend on the size of the task training dataset, and the difference in data distribution between pretraining and task domain data. The authors propose to compute a similarity score between the pretraining and task datasets in order to guide the fine-tuning from one target domain to another. They proposed measuring their similarity with the cosine distance and the maximum mean discrepancy (MMD). In particular, they proposed to average the feature embedding of each dataset and calculate the metrics between the two vectors. Given these metrics and the number of samples per class in the target domain dataset, authors proposed to either train the whole network or freeze the feature extractor and fine-tune the classifier. We follow their guidelines to determine the layers to freeze and to fine-tune. Pruning neural networks can be done in both main training phases-pretraining and fine-tuning. We concluded that there are four possible scenarios for pruning (as shown in Fig. 2). The first scenario consists in pruning a CNN on the source pretraining dataset. The idea behind this scenario is to leverage a large-scale dataset to guide our selection of the more relevant and discriminant source domain channels. The second scenario consists in pruning on the source pretraining dataset, and then fine-tuning until our model provides a suitable performance, and then prune again on the target application dataset. This strategy allows removing additional channels that are not contributing to our task. The third scenario consists in only pruned on our task dataset after the fine-tuning on the target application dataset. The objective of this scenario is to accelerate the training time since pruning and retraining on a large-scale source domain dataset can be time consuming. Finally, the last scenario consists in pruning on the task dataset before doing the fine-tuning. This scenario goal also reduces design time of the model. In Section 5, we seek to determine the best scenario to reduce the computational complexity of CNNs, while maintaining a comparable level of accuracy on our task.
The progressive soft pruning method is an interesting alternative since the model is pruned during fine-tuning steps. This pruning scheme can reduce training effort since it combines the pruning, retraining, and fine-tuning into a single step. In Fig. 2, progressive soft pruning would be represented by combining the prune and retrain process in one box for Scenario 1. For Scenarios 2 and 3, the fine-tuning, pruning, and retraining would be combined into one box. As for Scenario 4, PSFP is not applicable since the pruning, retrain, and fine-tuning is one step, making it impossible to prune the network by ranking the channels with the target data and then fine-tuning the network.

Experimental methodology
In this section, we present the experimental methodology used to validate the pruning model. Our experiment is divided into two main parts. First, we experiment on a large-scale dataset, i.e., ImageNet, in order to find the best pruning methods using the same experimental protocol. The second part of these experiments will be to test the pruning algorithms on a person ReID problem to find the advantage of using a pruned model compared to a smaller model. The following section will present the experimental methodology such as the datasets, the evaluation metrics, and the experiment algorithm.
The results for the pruning on the ImageNet dataset and ReID datasets will also be presented.

Datasets
Four publicly available datasets are considered for the experiments, namely Imagenet [74], Market1501 [37], DukeMTMC-reID [39], and CUHK03-NP [38]. Imagenet, a large-scale dataset, is used as a pretrained dataset and the rest of the other datasets (small-scale) are used for the experiments of person ReIDs.
• ImageNet (ILSVRC2012) [74] is composed of two parts. The first part is used for training the model and the second part is used for validation/testing. There is 1.2M images for training and 50k for validation. The ILSVRC2012 dataset contains 1000 classes of natural images. • Market-1501 [37] is one of the largest public benchmark datasets for person ReID. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding boxes obtained using the Deformable Part Models (DPM) pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. We follow the official testing protocol where 3368 query images are selected as a probe set to find the correct match across 19,732 reference gallery images.
• CUHK03-NP [38] consists of 14,096 images of 1467 identities. Each person is captured using two cameras on the CUHK campus and has an average of 4.8 images in each camera. The dataset provides both manually labeled bounding boxes and DPMdetected bounding boxes. In this paper, both experimental results on "labeled" and "detected" data are presented. We follow the new training protocol proposed in [75], similar to partitions of the Market1501 dataset. The new protocol splits the dataset into training and testing sets, which consist of 767 and 700 identities, respectively.
• DukeMTMC-reID [39] is constructed from the multi-camera tracking dataset-DukeMTMC. It contains 1812 identities. We follow the standard splitting protocol proposed in [76] where 702 identities are used as the training set and the remaining 1110 identities as the testing set.

Pruning methods
For our experiments, we compare five pruning methods in order to determine which technique gives the best compression ratio while maintaining a good performance on person ReID task. Our choice was based on the following criteria: article results, most of the families of the taxonomy are represented and the complexity for the ranking and the implementation. We selected L1 [33] and Entropy [34] as they rely on the techniques that prune the network only one time and then fine-tune the network. Although Taylor [32] uses iterative pruning techniques, we chose this method for our experiments because of its theoretical explanation and requires a single compression ratio. We choose to experiment with Auto-Balanced algorithm [70] because pruning is done by adding regularization terms to the original loss function in order to leave the pruning process for the optimization. We have also decided to try the Progressive Soft Pruning [36] method since it directly prunes from scratch and progressively prune during training which is a suitable test on our target operational domain.

Implementation details
For the Triplet-based ReID method, images are resized to 256 × 128 for all the datasets. For PCB [43] architectures, images are resized to 384 × 128. Like many state-of-art ReID approaches [3,[5][6][7][8]43], we use ResNet50 [11] as the backbone architecture, where the final layer is removed to get a 2048 feature representation. We apply all the pruning methods on the ResNet50 architecture. In order to be able to compare the four algorithms more easily, we decided to come up with a pruning schedule that would be similar for all the methods. First of all, we decided to prune around 5% of the total number of channels at each iteration. For the layer-by-layer methods, we chose to use a single compression rate for every layer in order to simplify our experiments and our comparison between the methods. For each pruning iteration, we decided to use 1 epoch for the ranking of the channels and 4 epochs for retraining before moving to the next iteration.
This pruning schedule was used for every method on ImageNet in order to produce our pruned models that would be used in the person ReID experiments. We have discarded the pruning iterations where the accuracy was too low since there was no advantage of using these networks for our task. Once our pruning was done for every method, we retrained every model on ImageNet to regain the loss of accuracy caused by the pruning. Each of our pruned models was then fine-tuned on the ReID datasets. We also finetuned pretrained ResNet18 and ResNet34 on these ReID datasets in order to compare the advantages of using pruned models compared to shallower networks.

Performance metrics
Following the common trend of evaluation [3,[5][6][7][8], we use the rank-01 accuracy of the cumulative matching characteristics (CMC) and the mean average precision (mAP) to evaluate the ReID accuracy. The CMC represents the expectation of finding a correct match in the top n ranks. When multiple ground truth matches are available, then CMC cannot measure how well the gallery images are ranked. Thus, we also report the mAP scores.
As the state-of-the-art pruning methods [33][34][35][36], the FLOPS's metric is used to calculate the model's complexity in terms of computational operations. To compare the different models during our experiments, we decided to calculate the number of FLOPS necessary to process one image through the model. We chose to compare the number of FLOPS since the processing time depends on the material used. The FLOPS is also a better metric than the number of pruned channels since a pruned channel at the beginning of the network will be reduced considerably more the total number of FLOPS than a later layer channel since the image dimension is reduced throughout the network. We also use the number of parameter metric to be able to compare models in terms of memory consumption to save the trained model. This metric was calculated by summing the number of weights needed throughout the model.   Results with pruned networks provide higher accuracy than the smaller ResNet18 while having similar computational complexity and memory consumption. For person ReID datasets, we attempt to preserve the same pruning compression ratio of 50% for comparison. This ratio is the highest compression level while minimizing the difference in the results between the baseline and the pruned models. We also prune the same number of filters per layer, use the same number of pruning iteration, and same finetuning iteration, layer-by-layer. Across layers is not the same, 50% filter gone, 5% filter at iteration, and stopping condition is 50% pruned away. Table 5 reports the results for Market-1501, DukeMTMC-reID, and CUHK03-NP ReIDs. The reported results are for all the Scenario. Taylor has higher FLOPS and a higher number of parameters than the other methods which would probably lead to a slower model and more consumption in terms of memory. Out of the 5 methods, the L1 method seems to be working the best by having the best or close to the best on the three datasets.

Pruning on pretraining data
The pruned models also have shown less performance drop in terms of accuracy while reducing considerably the number of FLOPS and parameters. Pruned models are faster than backbone ResNet50 network while having similar performance (around 1%). Plus, the pruned models have a similar number of FLOPS and parameters to ResNet18 while having better results on the three performance metrics. This means that pruning a larger model is more advantageous than using a shallower model like ResNet18.
To get a more global view of these results, the graphics in Fig. 3 depicts visually which models are better where the optimal placement would be top right and the worse would   be bottom left. There are two graphics for each dataset where the first one presents the mAP vs FLOPS and the second one presents Rank1 vs Parameters.

Pruning on target application data with weak ReID baseline
The objective of this experiment is to analyze and compare the pruning techniques with weak ReID baseline such as Trinet [3]. Table 6 reports the experimental evaluations of all the scenarios. For fair comparison, we chose to keep the compression ratio to 5% of the total number of channels. Our Scenario 2 results are produced using the HaoLi Iteration 3 model as the model pruned on the pretraining dataset. Using the same model for the considered techniques gives us a better idea on which pruning technique is the best when we pruned directly on our task dataset.
As we can observe in Table 6, the results with pruning directly on the target operational domain are not performing as good as the performances of the same pruned model with large-scale pretraining dataset. We can make the following observations from these results: (1) pruning and fine-tuning should be done on the same domain as in the case of Scenario 1 and Scenario 3, no matter whether it is source or target operational domain; (2) lack of data in target domain affects the pruning accuracy to regain the information loss by the pruning of the weak channels; (3) with the large-scale source dataset and the L1 method, we were able to prune our model to the same number of FLOPS as Scenario 2 (2.09 GFLOPS) but our Rank1 accuracy was 81.95% instead of 70.67%. The L1 method also seems to be better suited to prune directly on the task dataset compared to Taylor and Entropy. This might be explained by the fact that we do not have many samples per person since Taylor and Entropy approach uses a subset of samples to determine which channels to prune compared to the L1 method that ranks the channels with their weights; (4) Scenario 4 is not viable since all methods' performance drop drastically. (5) As for the Auto-Balanced and the PSFP techniques, they seem to outperform the other methods. This could be explained by the fact that autobalanced modifies the loss in order to transfer the information of the pruned channels to the remaining one. This scheme seems to help considerably when our number of samples is limited. The PSFP seems to be the best suited algorithm to pruned models on a limited dataset. This can probably be explained by the fact that we only zeroized the pruned channels which keep the model architecture which allows the recovery of certain softpruned channels during the fine-tuning phrase. Our results with the PSFP are also very similar to the ones obtained with the Scenario 1 scheme where we prune our models on the large-scale source dataset and then fine-tune on our task domain dataset. The great advantage of this method is the fact that we can prune and fine-tune our models in the same step. Plus, we are skipping the slow step of pruning on the very large ImageNet dataset.
To compare the scenarios further, we used two compression ratios which are around half the FLOPS (C1) and around one-third (C2) of the FLOPS of the original ResNet50 model. The Scenario 2 model for the first compression is using the second iteration L1 model as the model pruned on ImageNet. As for the second compression, we are using the third iteration. The results for the following experiments are found in Table 7. Table 7 shows us that Scenario 1 is truly the best one since all the results outperform the other ones for any method and any dataset. As for the comparison between Scenario 2 and 3, the conclusion to determine which one is better is hard to make   since Scenario 2 can be done using many configurations to get to a model similar to the one in Scenario 3. We could either prune more on the large-scale source dataset or prune less. Scenario 2 results are also affected by the choice of the pruned model on the large-scale source set. Our first compression results using the second iteration of the L1 method perform less in terms of recognition accuracy than pruning only on the target operational dataset (Scenario 3). But using the third iteration as shown in the second compression results, our Scenario 2 results are better than our Scenario 3 results.

Pruning on target application data with strong ReID baseline
The results shown in Table 7 indicate that PSFP so far is the best performing pruning approach in most scenarios. Additionally, since PSFP is suitable for deploying a compressed model-training can be done while the pruning is applied-we apply this technique on a strong ReID baseline. Therefore, the aim of this experiment is to analyze the effectiveness of the pruning techniques using a strong ReID baseline PCB [43]. We show two experimental analyses with PSFP pruning of PCB architectures. The first experiment shows the effect of ReID accuracy while pruning only backbone feature extractor (indicated as PCB (BFE)), while the second one considers all the layers (i.e., local convolutional layers and fully connected layers) after backbone architectures those perhaps use for feature compression and for classification tasks. In addition to the original ResNet [11], we also performed experiments with SE-ResNet [78] to see the effectiveness of pruning methods on different backbone CNNs.  Experimental results for the strong baseline PCB are reported in Table 8 both for Market-1501 and DukeMTMC-reID datasets. Our results show a consistency with the initial claims-the number of FLOPS and parameters required by PCB's ResNet and SE-ResNet architectures are reduced by half, while maintaining a comparable rank-1 accuracy for both ReID datasets. Results also suggest that PSFP pruning of local convolutional layers and FC layers have little effect on ReID accuracy as the margin of differences between PSFP+PCB(BFE) and PSFP+PCB(BFE+LC+FC) is small. This analysis implies that it is worth pruning backbone architecture rather with local convolutional and FC layers since it allows more memory and parameter reduction. It is worth noting that the margin of decline in mAP accuracy is higher than that of rank-1 accuracy for both backbones, and on all ReID datasets.

Filter selection criteria
As a part of ablation study, this experiment aims to analyze the effect of magnitude-based filter selection criteria such as l p -norm on ReID accuracy. We conducted this experiment with PSFP+PCB(BFE) on ReNet50. We show a comparative ReID performance analysis between l 1 -norm and l 2 -norm on Table 9. It can be observed from Table 9 that the ReID performance of l 2 -norm criteria is marginally better than that of l 1 -norm criteria. This is due to the effect of the largest element that has been dominant in l 2 -norm. As a consequence, the filters with largest weights preserved while pruning provide more discriminative features for better recognition accuracy.

Varying pruning rates
The objective of this experiment is to observe ReID performance when varying pruning rates. It was performed with PSFP+PCB(BFE). Figure 4a and b show the mAP and rank-01 accuracy obtained with varying pruning rates, respectively. With both measures, the Table 9 Comparison of network accuracy (mAP and rank-1) and complexity (M: Parameters and T: GFLOPS) with different pruning criteria of the PSFP approach on all the person ReID datasets. BFE, backbone feature extractor  accuracy of the pruned model drops exponentially with growing pruning rates. For pruning rates between 0 and 25%, the accuracy of the pruned model drops marginally. The pruning rate above 50% leads to drastic decline in ReID performance. When pruning a larger number of filters, the loss of information affects accuracy considerably.
To further analyze the pruning on target operational domain, we apply the best finetuning practices proposed in [73], as presented in Section 3.4. We have calculated their metrics for ImageNet and Market11501 and got 0.005 for the cosine distance and 2.45 for MMD. With these metrics and the fact that Market1501 has fewer than 20 samples for each class, the authors proposed to freeze the feature extractor during the fine-tuning to avoid overfitting since our task dataset is small and close to our large-scale pretraining dataset. Since our problem of a small dataset was showing during the retraining phase of the pruned network, we decided to try prune one layer at a time and freeze the others during the retraining phase. The goal of this strategy is to force the pruned layers to relearn the loss information while maintaining the other layers in the same optimal region as the baseline model. This method was tried for Scenario 2 with the L1 method. We decided to prune layer 5 of the ResNet50 while freezing the rest of the network. The model was pruned to 2.61 GFLOPS and the rank1 accuracy was 76.10%. This experiment shows that we could limit the effects of pruning by using a layer-by-layer approach and freezing the other layers to regain the accuracy. The problem with this scheme is that it is not very effective time-wise since it is a long and fastidious task to prune and retrain to the desired compression ratio for each layer instead of doing the whole model in one pass.

Conclusions
In this paper, we exploit the prunability of the state-of-the-art pruning models that are suitable for compressing deep architecture for person ReID application in terms of criteria to select channels and of strategies to reduce channels. In addition to that, we propose different scenarios or pipelines for leveraging a pruning method during the deployment of a network for a target application. Experimental evaluations on multiple benchmarks source and target datasets indicate that pruning can considerably reduce network complexity (number of FLOPS and parameters) while maintaining a high level of accuracy. It also suggests that pruning larger CNNs can also provide a significantly better performance than fine-tuning smaller ones. One key observation of the scenario-based experimental evaluations is that pruning and fine-tuning should be performed in the same domain.
Future experiments could explore a reduction in pruning iterations in order to reduce the impact of pruning on knowledge corruption. Retraining of the pruned networks could also be improved by adding a learning rate decay. Using layer-by-layer methods, with different compression ratios for each layer, can improve the results since some layers are more resilient to pruning than others. Techniques for freezing parts of the network can also improve accuracy, but drastically increase the time complexity for pruning and retraining phases. The soft pruning method could also benefit from better selection criteria, e.g., using a gradient-based approach instead of the norm of the channel weights. Finally, another interesting future experiment would be to avoid costly pruning on large pretraining dataset and only use the progressive soft pruning scheme to see if it can achieve similar results with higher compression ratios. While pruning approaches have proven to be effective in person ReID, we realized that it is focused on the same domain, which can limit its usage. In addition, work on pruning in the unsupervised learning settings is still quite limited. Future work could extend such pruning methods to unsupervised domain adaptation in person ReID.