Skip to main content

Advanced fine-tuning procedures to enhance DNN robustness in visual coding for machines

Abstract

Video Coding for Machines (VCM) is gaining momentum in applications like autonomous driving, industry manufacturing, and surveillance, where the robustness of machine learning algorithms against coding artifacts is one of the key success factors. This work complements the MPEG/JVET standardization efforts in improving the resilience of deep neural network (DNN)-based machine models against such coding artifacts by proposing the following three advanced fine-tuning procedures for their training: (1) the progressive increase of the distortion strength as the training proceeds; (2) the incorporation of a regularization term in the original loss function to minimize the distance between predictions on compressed and original content; and (3) a joint training procedure that combines the proposed two approaches. These proposals were evaluated against a conventional fine-tuning anchor on two different machine tasks and datasets: image classification on ImageNet and semantic segmentation on Cityscapes. Our joint training procedure is shown to reduce the training time in both cases and still obtain a 2.4% coding gain in image classification and 7.4% in semantic segmentation, whereas a slight increase in training time can bring up to 9.4% better coding efficiency for the segmentation. All these coding gains are obtained without any additional inference or encoding time. As these advanced fine-tuning procedures are standard-compliant, they offer the potential to have a significant impact on visual coding for machine applications.

1 Introduction

The digital age has given rise to an unprecedented growth of new media products and services that catalyze the surge of digital visual content in our multimedia-centric society. This trend has necessitated the development of various international video coding standards, such as Joint Photographic Experts Group 2000 (JPEG2000) [1], High Efficiency Image File Format (HEIF) [2], Advanced Video Coding (AVC) [3], High Efficiency Video Coding (HEVC) [4], and Versatile Video Coding (VVC) [5], which have become the cornerstones for efficient visual data transmission and storage for human consumption. In addition to that, visual data consumption is currently undergoing a transformative shift from human consumers towards a machine-to-machine paradigm, where automated content analysis is becoming predominantly driven by artificial intelligence. This shift takes place across various applications, including autonomous driving, intelligent transportation systems, smart manufacturing, and surveillance. To that end, Joint Video Exploration Team (JVET) of Moving Picture Expert Group (MPEG) and ITU-T Video Coding Experts Group (VCEG) seek to develop a new video coding standard called video coding for machines (VCM) [6] that is tailored to enhance the efficiency of deep neural network (DNN) machine task algorithms [7].

Figure 1 presents an overview of the general coding for machine workflow that underpins this work. This process involves transforming pristine visual content through the encoder, resulting in a compressed bitstream. These compressed data, while optimized for storage and transmission, undergo the decoder to reconstruct the visual content which is subsequently fed into a machine model. This model is designed to output a prediction that forms the basis of machine-driven decision-making processes. As highlighted in Fig. 1, MPEG/JVET standardization activities related to VCM [6] primarily concentrate on the encoder and decoder part of this workflow.

At reduced bitrates, the decreased machine performance (i.e., the DNN prediction ability) is mainly due to the introduction of coding artifacts [8,9,10,11], which are a direct consequence of the lossy compression techniques necessary for bitrate reduction [12, 13]. Earlier attempts have shown that some of the DNNs performance drops due to coding artifacts can be mitigated by employing a dedicated fine-tuning training procedure with compressed content [14, 15]. Nevertheless, existing solutions from the literature either suffer from a lack of machine performance enhancement [16] or a high training time requirement [14].

Our research addresses the challenge of enhancing the robustness of DNNs to coding distortions, which significantly impacts their predictive accuracy in compressed visual data. In this work, we propose and evaluate three advanced fine-tuning procedures designed to enhance machine resilience to coding artifacts:

  • Training procedure \(\mathcal {T}_{1}\) leverages similarities among coding artifacts by progressively increasing the distortion strength during the training advances [16].

  • Training procedure \(\mathcal {T}_{2}\) uses a regularization term designed to encourage DNN to produce similar predictions for compressed images and their pristine counterparts [17].

  • Training procedure \(\mathcal {T}_{3}\) combines the approaches of the first two training procedures, integrating both a regularization term and progressive training to improve machine resilience.

The advantages of these proposed training procedures include: (i) ease of implementation; (ii) maintenance of online inference and encoding time; (iii) improved DNN performance at equivalent bitrates; (iv) adjustable offline re-training time, allowing for an optimal balance between DNN performance and bitrate; (v) standard-compliant nature; and (vi) complements the existing standardization efforts undertaken by MPEG/JVET [6].

Another distinctive contribution of this article stems from its in-depth evaluation of training procedures. The transferability of results across various machine tasks and datasets is not guaranteed; hence, an extensive analysis is key. To address this, the proposed training procedures are evaluated and compared on two highly dissimilar machine tasks and datasets: image classification on ImageNet [18] and semantic segmentation on Cityscapes [19]. This in-depth evaluation framework not only strengthens the validity of our results but also provides unique insights. When compared against a conventional training procedure with compressed content, results indicate that bitrate savings are achieved for both machine tasks at equivalent machine performance, while necessitating a lower training time.

Fig. 1
figure 1

Overview of the general coding for machine workflow which aims to balance between minimizing the bitrate and maximizing the accuracy of subsequent machine-based predictions

The rest of this paper is structured as follows. Section 2 gives an overview of related works in the literature. The three training procedures \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) proposed to improve machines’ resilience to coding artifacts are introduced in Sect. 3. Section 4 outlines the experimental setup. The results of our study are presented in Sect. 5, followed by a discussion. Finally, Sect. 6 concludes the paper.

2 Related works

This section first covers works related to the limitations of DNN resilience towards general image degradation. General image degradation refers to any kind of image quality deterioration that occurs to an image during signal acquisition, transmission, or storage. A non-exhaustive list of general image degradations includes blurring, noise, contrast reduction, or coding artifacts. This section then focuses solely on one sub-category of general image degradation which is coding artifacts. In particular, existing solutions that mitigate the DNN lack of resilience to coding artifacts are explored.

2.1 Limits of DNN resilience to general image degradation

The resilience of DNN to general image degradation has received considerable attention in recent studies. Dodge et al. [8] investigated how different ImageNet [18] classifiers respond to blur, noise, JPEG coding artifacts, and contrast reduction. Subsequent research by Dodge et al. [20] revealed that despite the enhanced performance of popular classifiers on undistorted content, they still significantly lag behind human capabilities when classifying distorted images. Furthermore, Hendrycks et al. [11] conducted extensive benchmarks to evaluate the resilience of ImageNet [18] classifiers to general image degradations. They demonstrated that advancements in classifier architectures from AlexNet [21] to ResNets [22] have not necessarily translated into improved robustness against general image degradations, even as accuracy on pristine images has increased. These findings underscore the ongoing challenge of enhancing DNN robustness to general image degradations.

Numerous approaches have been proposed to enhance the generalization capabilities of DNNs in the presence of general image degradations. Sun et al. [23] introduced additional nonlinear layers to existing machine architectures, targeting resilience improvements against shifts in higher moment statistics, such as skewness and kurtosis. Zheng et al. [24] suggested employing a training procedure named stability training to enhance DNN robustness against minor image degradations. However, the effectiveness of stability training in improving robustness has been critically questioned by subsequent studies [11, 23] through various experimental evaluations. Employing a similar strategy to stability training, Kannan et al. [25] introduced a stability term to the loss function, designed to reduce the discrepancy between predictions on distorted and pristine images. Additionally, employing training images subjected to random style transfer [26] has been proposed by Geirhos et al. [27] as a method to prevent DNN from overfitting on texture information but rather on shape information.

2.2 Improving DNN resilience to coding artifacts

In the coding for machines domain, Fischer et al. [15] demonstrated remarkable enhancements in DNN resilience with the data augmentation (DA) training procedure that trains one DNN model on a mixture of coding distortions at once. However, it is worth noting that this approach exhibits limitations when confronted with a diverse set of coding distortions [16] or with coding artifacts from compression standards unseen at training. Alternatively, Löhdefink et al. [14] observed significant improvements in machine performance when employing a fine-tuning training procedure. This observation was made based on a very limited subset of coding distortions among the ones considered, emphasizing the importance of effectively addressing the training time associated with these approaches. One aspect common to both DA and fine-tuning training procedures is that they both take advantage of compressed content at training time. Indeed, compared to not using any fine-tuning training procedure, it has been widely shown in the literature that re-training machine models on compressed data enables significant DNN prediction ability improvement [28,29,30].

To evaluate the solutions proposed in this work, we define two established approaches as existing solutions: (1) the pre-trained baseline procedure and (2) the conventional training procedure \(\mathcal {T}_{0}\). These will serve as benchmarks against which the proposed methods are compared.

2.2.1 Pre-trained baseline procedure

Figure 2 illustrates the pipeline used in the pre-trained baseline procedure, in which a prediction \(g(\hat{I})\) is obtained from a machine model \(g\) where no re-training procedure has been involved. In a nutshell, compressed images \(\hat{I}\) are fed to a machine model \(g\) pre-trained on pristine images \(I\) to obtain predictions \(g(\hat{I})\). As the pre-trained machine model \(g\) has never encountered coding artifacts during its training process, the statistical shift between training images \(I\) and testing images \(\hat{I}\) makes the pre-trained baseline procedure inherently limited in terms of achievable machine performance.

Fig. 2
figure 2

Pipeline for the pre-trained baseline procedure. The machine model \(g\) is a DNN model pre-trained on pristine images

2.2.2 Conventional training procedure \(\mathcal {T}_{0}\)

Figure 3 illustrates the pipeline of the conventional training procedure \(\mathcal {T}_{0}\), which alleviates the shortcomings of the pre-trained baseline procedure by taking advantage of compressed images \(\hat{I}\) with an additional fine-tuning step. Similar to the pre-trained baseline procedure, a machine model \(f\) pre-trained on pristine images \(I\) is employed. Rather than directly using weights of the pre-trained machine model \(f\) to obtain predictions on compressed images \(\hat{I}\), the conventional training procedure \(\mathcal {T}_{0}\) utilizes weights of the pre-trained machine model \(f\) as initialization weights for the re-training step. As it has already been proposed in the coding for machines context [14, 15], the machine model \(f\) is then fine-tuned by comparing predictions \(f(\hat{I})\) of the machine model \(f\) on compressed images \(\hat{I}\) against a ground truth through the original loss function \(\ell _{0}\). Ultimately, weights of the machine model \(f\) are updated iteratively to provide better and better predictions with the use of backpropagation.

Fig. 3
figure 3

Pipeline for the conventional training procedure \(\mathcal {T}_{0}\) that fine-tune the machine model \(f\) on compressed images \(\hat{I}\). Weights of the machine model \(f\) are initialized with the weights of a model pre-trained on pristine images

3 Proposed training procedure

DNN solving machine task algorithms are mostly trained on high-quality visual content. This is because popular datasets tend to not incorporate any coding artifacts [19], or only slight ones with the use of high-quality JPEG compression [18, 31]. Using such data at training time, built DNN tend to lack in their ability to provide accurate prediction on lower-quality content. Therefore, it is desirable to enhance DNN resilience to coding artifacts to mitigate the bias identified above. To this end, this section proposes three advanced offline fine-tuning procedures that are easily applicable given the availability of pristine images and their associated ground truth labels.

3.1 Training procedure \(\mathcal {T}_{1}\): progressive training

Figure 4 illustrates the pipeline of the proposed training procedure \(\mathcal {T}_{1}\) which aims to enhance the resilience of machine models on content that incorporates coding distortions. This procedure, first introduced in our previous work [16], shares some similarities with curriculum learning training procedures [32, 33]. \(\mathcal {T}_{1}\) is based on the assumption that a machine model might have difficulties generalizing when a wide range of coding distortions and strengths come into consideration. To this end, the key idea behind \(\mathcal {T}_{1}\) is to obtain separate DNN weights for distinct coding distortions through a single training. Instead of relying on a single set of DNN weights, the motivation to employ separate DNN weights for distinct coding distortions can be understood from the observation that such a strategy has shown its potential to offer enhanced coding efficiency [16]. The use of separate DNN weights permits each DNN model to specialize on a predetermined coding distortion or quality range, as it is assumed that such a specialized model will not be employed to obtain predictions on images that contain vastly different coding artifacts. Nonetheless, a previous work [17] has shown that a DNN model specialized for a given coding distortion still offers some degree of generalization across other coding distortions.

This is done by increasing the distortion strength progressively as training advances, where the distortion strength parameter starts at \(Q _{0}\) and ends at \(Q _{\infty }\). This is illustrated on the left part of Fig. 4 where the progressive training function \(\mathcal {P}\) determines which coding distortion should be selected at a given epoch \(e_{i}\). As an example, the coding distortion is characterized by the parameter controlling the amount of quantification in an encoder, such as the quality for JPEG or the quantization parameter for AVC, HEVC, and VVC-based compression standards. At each epoch \(e_{i}\), the distortion strength parameter \(Q\) is equal to \(\mathcal {P}(e_{i})\), which is determined by the following equation:

Fig. 4
figure 4

Pipeline for the proposed training procedure \(\mathcal {T}_{1}\) that increases the distortion strength progressively as the training advances

$$\begin{aligned} \mathcal {P}(e_{i}) = Q _{\infty } + \Delta Q \lfloor \frac{1}{\Delta Q } ( Q _{0}- Q _{\infty }) \exp (-s e_{i}) \rceil \text {,} \end{aligned}$$
(1)

where \(s \in \mathbb {R}^{+*}\) controls the speed at which the coding distortion strength \(Q\) converges towards \(Q _{\infty }\), and where \(\Delta Q \in \mathbb {N}^{*}\) refers to the step size between two consecutive coding distortion strength \(Q\). The intuition behind the proposed training procedure \(\mathcal {T}_{1}\) is that achieving high accuracy on images with a coding distortion \(\mathcal {P}(e_{i})\) is easier if the DNN is already robust to images of slightly higher qualities \(\mathcal {P}(e-1)\), and so on. Given that converging to a minimum of the DNN loss function is increasingly harder as the coding distortion strength \(Q\) increases, an exponential decay function is used to decrease the pace at which image quality is reduced as the training progresses. The DNN weights of the machine model \(f\) are initialized with weights pre-trained on pristine images. As a consequence of the weights used in the initialization, it should be noted that the distortion found in the first epoch \(Q _{0}\) should cause very few artifacts, if any. That way, compressed images \(\hat{I}\) from the first epoch \(e_{0}\) are as similar as possible to pristine images. As is the case for many encoders, it is likely that compressed images \(\hat{I}\) from the first epoch \(e_{0}\) are not strictly lossless. Indeed, performing JPEG compression with a quality of \(100\) or video compression with a quantization parameter of \(0\) from an HEVC or VVC-based encoder may still induce artifacts.

To ensure that the proposed training procedure \(\mathcal {T}_{1}\) converges in a finite number of epochs towards the coding distortion with the lowest quality \(Q _{\infty }\), a floor operator \(\lfloor . \rfloor\) is applied in Eq. 1 when the coding distortion strength \(Q\) decreases as the training progresses (i.e., \(Q _{0} > Q _{\infty }\)). Conversely, a ceil operator \(\lceil . \rceil\) is applied when the coding distortion strength \(Q\) increases, such as in AVC, HEVC, and VVC-based compression standards. As floor and ceiling operators are used, a given coding distortion is used for multiple epochs. At evaluation time, DNN weights from epoch \(e_{i}\) are utilized by the machine model \(f\) to generate predictions on images that include artifacts from one particular coding distortion \(\mathcal {P}(e_{i}) = \mathcal {P}(e_{j})\), such that

$$\begin{aligned} \forall e_{j} \in \mathbb {N}, \exists ! e_{i} \in \mathbb {N} \quad \text {s.t.} \quad e_{j} < e_{i} \quad \text {and} \quad \mathcal {P}(e_{j}) = \mathcal {P}(e_{i}) \text {,} \end{aligned}$$
(2)

where epochs \(e_{i}\) and  \(e_{j}\) refer to two distinct epochs. Here, the epoch \(e_{i}\) refers to the very last epoch on which the machine model \(f\) was trained with the coding distortion \(\mathcal {P}(e_{i}) = \mathcal {P}(e_{j})\). As long as no overfitting occurred while training the machine model \(f\) on the coding distortion \(\mathcal {P}(e_{i}) = \mathcal {P}(e_{j})\), using at evaluation time DNN weights from the last epoch \(e_{i}\) ensures optimal machine performance on that coding distortion.

3.2 Training procedure \(\mathcal {T}_{2}\): regularization term

The proposed training procedure \(\mathcal {T}_{2}\) proposes to use a novel loss function where a regularization term is added to the original loss function, as described in our previous work [17] and illustrated in Fig. 5. This training procedure takes advantage of both the compressed image \(\hat{I}\) and its pristine counterpart \(I\). Additionally, two distinct machine models denoted \(f\) and \(g\) are leveraged to obtain both predictions \(f(\hat{I})\) and \(g(I)\), respectively. Prediction \(f(\hat{I})\) is generated by simply using the compressed image \(\hat{I}\) as input of machine model \(f\). Similarly, the pristine image \(I\) is fed to the machine model \(g\) to obtain the prediction \(g(I)\). Both machine models \(f\) and \(g\) are DNN models that share the exact same machine architecture. While the machine model \(f\) is the machine model being re-trained to have an enhanced resilience to coding artifacts, the weights \(\theta _{g}\) of the machine model \(g\) are fixed during the training process. Prior to the training process, weights \(\theta _{f}\) and \(\theta _{g}\) of both machine models \(f\) and \(g\) are initialized with weights that were pre-trained on undistorted images. Since the machine model \(g\) is not re-trained and utilizes weights \(\theta _{g}\) that were pre-trained on undistorted content, the machine model \(g\) will stay specialized on pristine images \(I\). Alternatively, the machine model \(f\) will learn how to handle artifacts within compressed images \(\hat{I}\) as the training advances through back-propagation.

Fig. 5
figure 5

Pipeline for the proposed training procedure \(\mathcal {T}_{2}\) that incorporates a regularization term to the original loss function

Similarly to the conventional training procedure \(\mathcal {T}_{0}\), the original loss \(\ell _{0}\) is computed by comparing both the prediction \(f(\hat{I})\) of the DNN model \(f\) on compressed images \(\hat{I}\) with its ground truth. For the image classification machine task, the cross-entropy loss function is considered for the original loss function \(\ell _{0}\) as commonly done for image classifiers [22, 34]. The semantic segmentation machine task, which can be regarded as a classification problem on each pixel taken individually, also employs the cross-entropy as the original loss function \(\ell _{0}\). On top of the original loss \(\ell _{0}\), \(\mathcal {T}_{2}\) incorporates a regularization term \(\ell _{reg}\). The regularization term \(\ell _{reg}\) compares the prediction \(f(\hat{I})\) against the pristine prediction \(g(I)\) by employing a distance function \(D\). In practice, the KL-divergence is employed as the distance function \(D\) for the proposed training procedure \(\mathcal {T}_{2}\) that employs a regularization term \(\ell _{reg}\). Ultimately, both the original loss function \(\ell _{0}\) and the added regularization term \(\ell _{reg}\) are summed together to obtain the final loss function \(\ell\). A weighting scalar value \(\alpha\) is employed to control precisely the balance between both the original loss function \(\ell _{0}\) and the added regularization term \(\ell _{reg}\). The final loss function \(\ell\) is summarized with the following equations:

$$\begin{aligned}{} & {} \ell (I, \hat{I}; \theta _{f}) = \ell _{0}(\hat{I};\theta _{f}) + \alpha \ell _{reg}(I, \hat{I}; \theta _{f}) \text {, where} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \ell _{reg}(I, \hat{I}; \theta _{f})=D(g(I), f(\hat{I}))\text {.} \end{aligned}$$
(4)

Despite using two machine models \(f\) and \(g\), it must be noted that \(\mathcal {T}_{2}\) does not increase the required inference time compared to the conventional training procedure \(\mathcal {T}_{0}\). Indeed, once the machine model \(f\) has been re-trained to become resilient to coding artifacts, the machine model \(g\) and computations related to loss functions are discarded.

The proposed training procedure \(\mathcal {T}_{2}\) shares some similarities with the stability training procedure proposed by Zheng et al. [24]. In a nutshell, stability training aims to ensure that a machine model provides consistent outputs for similar inputs. To this end, a new stability term \(\ell _{stability}\) is combined with the original loss function \(\ell _{0}\) to obtain the final loss function \(\ell\) defined as follows:

$$\begin{aligned}{} & {} \ell (I, I'; \theta _{f}) = \ell _{0}(I;\theta _{f}) + \alpha \ell _{stability}(I, I'; \theta _{f}) \text {, where} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \ell _{stability}(I, I'; \theta _{f})=D(f(I), f(I'))\text {.} \end{aligned}$$
(6)

Despite some notable similarities, there are some inherent differences between \(\mathcal {T}_{2}\) and stability training [24]. The key differences between the two methods are threefold. First, stability training seeks to achieve general resilience to general image degradation, while \(\mathcal {T}_{2}\) focuses solely on one sub-category of general image degradation which is coding artifacts. The proposed training procedure \(\mathcal {T}_{2}\) final loss function \(\ell\) from Eq. 3 makes use of the compressed image \(\hat{I}\). In contrast, stability training employs a pixel-wise additive white Gaussian noise as a general unbiased degradation to generate degraded images \(I'\):

Fig. 6
figure 6

Pipeline for the proposed training procedure \(\mathcal {T}_{3}\) that jointly uses both the progressive training procedure \(\mathcal {T}_{1}\) with \(\mathcal {T}_{2}\) that incorporates a regularization term to the original loss function

$$\begin{aligned} I'(i,j) = I(i,j) + \epsilon (i,j) , \epsilon (i,j) \sim \mathcal {N}(0,\,\sigma ^{2})\,, \end{aligned}$$
(7)

where \((i,j)\) represents a pixel coordinate, \(\epsilon\) the added noise, and \(\sigma ^{2}\) the variance of the Gaussian distribution \(\mathcal {N}\). As a consequence, stability training makes no use of any prior knowledge from the context of coding for machines, where it is arguable that the main source of general image degradation comes from the compression stage. Because stability training only leverages small image degradation \(I'\) of the pristine image \(I\), it is less prone to offer high resilience to one particular and harsh degradation type that are coding artifacts. Second, Eq. 5 shows that stability training aims to minimize the original loss function \(\ell _{0}\) on undistorted images \(I\). Third, \(\mathcal {T}_{2}\) employs two separate machine models \(f\) and \(g\) within the regularization term \(\ell _{reg}\), while stability training employs a single machine model \(f\) to generate both the distorted prediction \(f(I')\) and the pristine prediction \(f(I)\). One implication arising from these last two key differences is that a machine model trained with the stability training procedure must keep good performance on both undistorted images \(I\) and degraded images \(I'\). Conversely, \(\mathcal {T}_{2}\) enables higher machine performance opportunities by narrowing the focus on coding artifacts from compressed image \(\hat{I}\).

3.3 Training procedure \(\mathcal {T}_{3}\): joint use of training procedures

The proposed training procedure \(\mathcal {T}_{3}\) combines both \(\mathcal {T}_{1}\) which progressively increases the distortion strength as the training advances and \(\mathcal {T}_{2}\) which adds a regularization term to the original loss function. Figure 6 illustrates the pipeline of this proposed joint training procedure. As \(\mathcal {T}_{3}\) integrates both \(\mathcal {T}_{1}\) and \(\mathcal {T}_{2}\) in a unified fashion, the pipeline shares similarities with the ones from Figs. 4 and 5. Similar to \(\mathcal {T}_{2}\), the machine model \(g\) provides predictions \(g(I)\) with pristine images \(I\). As both pristine images \(I\) and the DNN model \(g\) are fixed, predictions \(g(I)\) stay the same throughout the whole training. On the opposite, predictions \(f(\hat{I})\) from the machine model \(f\) vary across training time for two distinct reasons. First, the weights of the DNN model \(f\) update with backpropagation. Second, the use of \(\mathcal {T}_{1}\) makes the distortion strength of compressed images \(\hat{I}\) increase more and more as the training progresses. By taking advantage of both predictions and the ground truth, the final loss \(\ell\) is computed through a weighted sum between the original loss \(\ell _{0}\) and the regularization term \(\ell _{reg}\). Ultimately, the backpropagation takes advantage of the final loss \(\ell\) to increase the machine model \(f\) robustness to coding artifacts.

4 Experimental setup

This section describes the overall framework used to conduct the experimental evaluations. The proposed training procedures \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) are evaluated over the conventional training procedure \(\mathcal {T}_{0}\), which acts as an anchor. These four training procedures all involve a re-training step using compressed data. Moreover, as most papers related to the coding for machines context do not imply any re-training on compressed images [10, 12, 13, 35,36,37,38,39,40,41], the pre-trained baseline procedure introduced in Sect. 2.2 is also considered. As opposed to \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\), the pre-trained baseline procedure produces predictions by solely relying on a machine model pre-trained on pristine data. Additionally, the data augmentation (DA) training procedure which utilizes a single set of model weights to generalize across every degradation type is included in the experiments to compare against  \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) which rely on separate model weights for each coding distortion.

4.1 Machine tasks and datasets

Experiments are conducted on two distinct machine tasks, namely image classification and semantic segmentation, using the ImageNet [18] and the Cityscapes [19] datasets, respectively. The rationale behind the choice of these machine tasks and datasets lies within their dissimilarities in numerous aspects, as highlighted in Table 1.

Table 1 Dissimilarities of considered datasets

Dissimilarities of the considered machine tasks and datasets are threefold. First, the ImageNet [18] and the Cityscapes [19] datasets are dissimilar in terms of image resolution, number of training images, and content type. Second, image classification and semantic segmentation can be regarded as low-level and high-level machine task algorithms, respectively. Indeed, image classification necessitates the prediction of a single class per image, while the semantic segmentation machine task is done with pixel-by-pixel labeling of the input image. Third, ImageNet is composed of JPEG-compressed images, whereas Cityscapes exclusively features pristine, uncompressed images. Such inherent differences in datasets will affect the generalization ability to coding artifacts of deep learning-based machine models.

4.2 Coding standard

To assess the effectiveness of the proposed training procedures, the JPEG coding standard is employed to generate coding distortions. The underlying motivation for this choice lies in the observation that JPEG is the most widespread image encoder even nowadays [42]. Indeed, related works [8, 9, 11] that evaluate the resilience of DNN to general image degradations only consider JPEG when it comes to image coding artifacts. A total of \(\# Q = 6\) coding distortions are considered; they are obtained using the JPEG implementation of the Pillow library version \(8.0\) and JPEG qualities \(Q = \{30, 20, 15, 10, 7, 5\}\). It is worth mentioning that similar results have been obtained on other coding standards than JPEG when both training procedures \(\mathcal {T}_{1}\) and \(\mathcal {T}_{2}\) have been evaluated in the previous works [16, 17]. Notably, these other coding standards include JPEG2000, Better Portable Graphics (BPG) [43] image encoders, as well as the all intra configuration of Joint Test Model (JM) [44], x265 [45], and Fraunhofer Versatile Video Encoder (VVenC) [46] video encoders.

4.3 Comparison with alternative training procedures

The data augmentation (DA) training procedure is considered in the experiments. The DA training procedure utilizes a single set of model weights to generalize across every degradation type, while \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) rely on separate model weights for each coding distortion. The practical implementation of DA is similar to the training procedure  \(\mathcal {T}_{0}\), with the exception that each forward pass at training time is performed with a compressed image \(\hat{I}\) encoded with a random JPEG quality. The random JPEG quality for the DA training procedure is drawn from a uniform probability distribution in the range \(5 \le Q \le 70\). It is important to note that the JPEG quality range \(5 \le Q \le 70\) used for DA encompasses a wider quality range than the one depicted in Sect. 4.2. This choice can be understood from the observation that DA must generalize across every degradation type with a single set of model weights, while other training procedures rely on separate model weights for each coding distortion. As a consequence, higher JPEG qualities are incorporated in the DA training phase to ensure that DNN models trained with the DA training procedure remain reliable on higher-quality images.

As mentioned in Sect. 4.1, the Cityscapes dataset [19] is exclusively composed of pristine images that do not incorporate any compression artifacts. Consequently, on top of the pre-trained baseline procedure, another procedure denoted as baseline* is incorporated in the results related to the semantic segmentation machine task. The baseline* procedure is similar to the baseline procedure, with the exception that baseline* fine-tunes the baseline DNN weights using compressed images \(\hat{I}\) encoded with a very high JPEG quality. As the ImageNet dataset [18] is mostly composed of JPEG images encoded with a quality of \(Q = 96\), the chosen JPEG quality for the baseline* procedure is fixed to \(Q = 96\) as well.

It is worth noting that the stability training procedure from Zheng et al. [24] is not included in the experiments despite its similarity with the training procedure \(\mathcal {T}_{2}\). This is motivated by preliminary work [17] where stability training [24] has shown a much worse performance in image classification accuracy compared to training procedures \(\mathcal {T}_{0}\) and \(\mathcal {T}_{2}\) which incorporate coding artifacts at training time.

4.4 Practical implementation

On top of conducting experiments on two distinct machine tasks and datasets, distinct machine architectures have to be employed for each machine task. For the image classification machine task on ImageNet [18], the machine architecture ResNet50 [22] is considered. Pre-trained DNN weights of the ResNet-50 model from Pytorch implementation with a reported top-1 validation accuracy of \(76.13\%\) are utilized to initialize DNN models \(f\) and \(g\) for baseline*, DA, \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\). For the semantic segmentation machine task on Cityscapes [19], the machine architecture DeepLabV3+ [47] with a ResNet50 [22] backbone is employed for experiments. The MMSegmentation library [48] is used to gather pre-trained weights of the DeepLabV3+ machine model, with a reported a mean intersection over union (mIoU) of \(0.8028\). Additionally, implementation, training, and evaluation are done by utilizing the MMSegmentation library for all considered pre-trained and training procedures.

Table 2 Summary of the dataset, machine architecture, and hyper-parameters used for both the image classification and semantic segmentation experiment

Table 2 provides a summary of the dataset, the machine architecture, and the hyper-parameters used for both the image classification and semantic segmentation experiments. The JPEG starting quality \(Q _{0}\), the JPEG convergence quality \(Q _{\infty }\), the JPEG quality step \(\Delta Q\), and the convergence speed \(s\) refer to hyper-parameters from Eq. 1. Such hyper-parameters are employed by both \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\) that utilize the progressive training procedure described in Sect. 3.1. The regularization term weight \(\alpha\) from Eq. 3 is employed by both \(\mathcal {T}_{2}\) and \(\mathcal {T}_{3}\) that rely on the addition of a regularization term to the original loss function, as described in Sect. 3.2. The adaptive moment estimation (ADAM) [49] and stochastic gradient descent (SGD) optimizers are employed for the image classification and semantic segmentation machine tasks, respectively. When multiple learning rates or regularization term weight \(\alpha\) are specified, considered values in the experiments are the ones that achieved the highest machine performance among the specified values. This choice provides greater consistency across experiments by employing appropriate hyper-parameters for each training. For the sake of clarity, it is not specified which learning rate and regularization term weight \(\alpha\)  were used for each individual training. Nevertheless, it is noteworthy that the experiments are fully reproducible by performing one training for each hyper-parameter combination.

4.5 Assessment metrics

In the coding for machines context, an appropriate trade-off between the machine performance and the bitrate is sought. Regarding the machine performance measure, the top-1 accuracy and the mIoU metrics are used to assess the machine performance of image classifiers and semantic segmentation machine models, respectively. As both machine tasks considered in experiments involve still images, the bit per pixel (bpp) is used as a measure of the bitrate. Finally, coding efficiency is evaluated using the well-known bjøntegaard delta bitrate (BD-rate) evaluation method [50] that computes average bitrate differences for the same machine performance. Similarly to the common test conditions (CTC) [51], the piecewise cubic Hermite interpolation method is employed to compute BD-rate scores.

On top of measuring trade-offs between bitrate and machine task performance, the training time is also evaluated. The evaluated training time is expressed in epochs. One may observe that the required computation time to perform an entire epoch is equivalent across training procedures DA, \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\). This is true with the assumption that computing the regularization term \(\ell _{reg}\) and the corresponding weighted sum from Eq. 3 is negligible in terms of computing time compared to the forward and backward pass of the machine model \(f\). Indeed, as both pristine images \(I\) and the machine model \(g\) are fixed, predictions \(g(I)\) are pre-computed and do not interfere with the training time of \(\mathcal {T}_{2}\) and \(\mathcal {T}_{3}\). Note that the training time is expressed in a measure of time that is inherent to the training in itself. Consequently, this allows measurement noise due to the execution of unrelated background programs or hardware-related problems to be circumvented.

For \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\) which involve the progressive training procedure, training is stopped at the end of the last epoch with the lowest JPEG quality \(\text {min}( Q ) = 5\). The resulting number of epochs accounts for the training time of such training procedures. For \(\mathcal {T}_{0}\) and \(\mathcal {T}_{2}\) which do not involve the progressive training procedure, one separate training is run for each coding distortion. As a result, the final training time is obtained by adding up the number of epochs required until convergence for each of the \(\# Q = 6\) individual training. Similarly to \(\mathcal {T}_{0}\) and \(\mathcal {T}_{2}\), the final training time for DA is based on the number of epochs required for the unique training to converge.

As \(\mathcal {T}_{0}\), \(\mathcal {T}_{2}\), and DA are trained until convergence, reported BD-rate scores allow the measurement of the highest achievable coding efficiency for each training procedure. However, one disadvantage of such an experimental protocol lies in the increased variance for the measured training time due to the random nature of the training. The motivation to use such an experimental setup can be understood from the observation that improving coding efficiency is of far greater importance in the VCM context [51].

It is worth noting that the total energy consumption used to train one machine model shares a linear relationship with the number of epochs required to train such a model. Indeed, the total number of computations needed to perform one training epoch is equivalent across all training procedures \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\). Once deployed in a real-world application, the power consumption of the machine model inference is equivalent for all training procedures. This can be understood from the observation that only the trained machine model \(f\) is kept at inference time, regardless of the training procedure used.

5 Experimental results

This section presents and discusses the experimental results of all the considered procedures, namely baseline, baseline* for the semantic segmentation machine task, DA, \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\).

5.1 Results on image classification

Figure 7 presents the experimental results for the image classification machine task, where the relative training time and BD-rate scores are compared against the anchor \(\mathcal {T}_{0}\). BD-rate scores are computed with top-1 accuracy as the machine task performance measure. In addition, Fig. 8 illustrates the trade-offs between bpp and image classification top-1 accuracy; shown curves refer to the points surrounded with dashed circles in Fig. 7.

Fig. 7
figure 7

Trade-off between relative training time and BD-rate compared to the anchor \(\mathcal {T}_{0}\) for the image classification machine task. Each point of \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\) are obtained with convergence speed \(s \in \{0.05, 0.1\}\) and \(s \in \{0.03, 0.05, 0.075, 0.1\}\), respectively. Points surrounded with dashed circles are those whose curves between bitrate and top-1 accuracy are shown in Fig. 8

Fig. 8
figure 8

Comparison of bpp and top-1 accuracy for the pre-trained baseline procedure, the anchor \(\mathcal {T}_{0}\), and the training procedure \(\mathcal {T}_{3}\) with convergence speed \(s=0.03\). Shown curves between bitrate and top-1 accuracy are those whose points are surrounded with dashed circles in Fig. 7

Results show that the pre-trained baseline procedure fails to compete in terms of coding efficiency with a BD-rate increase of \(91.6\%\) compared to the anchor \(\mathcal {T}_{0}\). It is worth mentioning that this BD-rate value has a very low reliability, as the overlap between anchor \(\mathcal {T}_{0}\) and baseline is almost non-existent. Nevertheless, the very high BD-rate value of \(91.6\%\) demonstrates the poor coding efficiency of the baseline procedure when compared against the anchor \(\mathcal {T}_{0}\). These results are well aligned with the literature [14, 15, 28,29,30, 52] where the importance of incorporating coding artifacts at training time has already been shown empirically. Moreover, \(\mathcal {T}_{1}\) offers a lower training time than the anchor \(\mathcal {T}_{0}\), but it is at the cost of an increase in BD-rate scores. Conversely, \(\mathcal {T}_{2}\) provides a \(-1.3\%\) BD-rate reduction over the anchor \(\mathcal {T}_{0}\), which is consistent with preliminary works [17]. Nevertheless, \(\mathcal {T}_{2}\) makes the considered DNN model converge at a slower pace. While the anchor \(\mathcal {T}_{0}\) requires a total of \(98\) epochs to reach convergence on the \(\# Q = 6\) separate training, \(\mathcal {T}_{2}\) requires a total of \(172\) epochs accounting for \(175.5\%\) in terms of relative training time.

Interestingly, using both the regularization term and the progressive training procedures, \(\mathcal {T}_{3}\) offers BD-rate improvement with a lower training time than the anchor \(\mathcal {T}_{0}\). Results show that a \(-2.4\%\) BD-rate reduction is achieved at equivalent image classification top-1 accuracy. Furthermore, a BD-rate of \(-4.0\%\) is obtained with a relative training time of \(109.2\%\) using the slowest convergence speed \(s=0.03\). These results point out that tuning the convergence speed \(s\) in Eq. 1 allows different trade-offs between bitrate, machine performance, and training time to be reached. In addition, the fact that both \(\mathcal {T}_{2}\) and \(\mathcal {T}_{3}\) achieve better coding efficiency over the anchor \(\mathcal {T}_{0}\) further emphasizes the importance of the added regularization term for the image classification experiment.

On the image classification experiments, the DA training procedure achieves a \(6.4\%\) BD-rate increase over the anchor \(\mathcal {T}_{0}\). It is important to emphasize that DA is inherently disadvantaged over  \(\mathcal {T}_{0}\), \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) as it utilizes a single set of model weights to generalize across every degradation type. The observation that \(\mathcal {T}_{0}\) achieves better coding efficiency than DA is aligned with the results of a similar experiment conducted in a previous work [16].

Figure 8 show the bpp and top-1 accuracy values for six curves: the pre-trained baseline procedure, the anchor \(\mathcal {T}_{0}\),  DA, \(\mathcal {T}_{1}\) with the convergence speed \(s=0.05\), \(\mathcal {T}_{2}\), and the training procedure \(\mathcal {T}_{3}\) with the slowest convergence speed \(s=0.03\). These curves correspond to the points surrounded by dashed circles in Fig. 7. As introduced in Sect. 2.2 and Fig. 2, the baseline curve refers to the pre-trained machine model on the original dataset images without any further re-training steps. Forseeably, the pre-trained baseline procedure model is largely unable to compete with other training strategies that involve re-training using compressed data. On the one hand, the proposed training procedure \(\mathcal {T}_{3}\) increasingly outpaces the anchor \(\mathcal {T}_{0}\) performance as the bitrate diminishes. A \(1.3\%\) top-1 accuracy improvement is achieved over the anchor \(\mathcal {T}_{0}\) on the coding distortion with the lowest bitrate, while the machine performance is the same on the coding distortion with the highest bitrate. Indeed, using DNN weights pre-trained on pristine images in the anchor \(\mathcal {T}_{0}\) becomes increasingly less relevant as the distortion strength on which the DNN model is re-trained increases. On the other hand, the training procedure DA is only able to compete with the anchor \(\mathcal {T}_{0}\) at higher bitrates. At lower bitrates, the coding efficiency of the DA training procedure deteriorates at a much faster pace compared to the anchor \(\mathcal {T}_{0}\).

5.2 Results on semantic segmentation

Figure 9 presents the experimental results for the semantic segmentation machine task. These results compare the relative training time and BD-rate scores against the anchor \(\mathcal {T}_{0}\) using mIoU as the measure of machine performance. As there is no overlap between the anchor \(\mathcal {T}_{0}\) and the baseline procedure, the BD-rate of the baseline procedure cannot be computed. Consequently, the BD-rate of the baseline* procedure is reported instead. In addition, Fig. 10 illustrates the trade-offs between bpp and mIoU; shown curves refer to the points surrounded with dashed circles in Fig. 9.

On the one hand, some outcomes align with the findings from Sect. 5.1 regarding the image classification machine task. As an example, the use of the baseline* procedure resulted in an increase of \(98.0\%\) BD-rate. Similar to the BD-rate of the baseline procedure on the image classification experiment, the BD-rate value of \(98.0\%\) is not reliable because of the very low overlap between anchor \(\mathcal {T}_{0}\) and the baseline* procedure. The training procedure \(\mathcal {T}_{3}\) also offers BD-rate reductions. Relative to the anchor \(\mathcal {T}_{0}\) and with lower training time, \(\mathcal {T}_{3}\) achieve a \(-7.4\%\) BD-rate decrease at equivalent mIoU for the semantic segmentation machine task. Additionally, up to \(-9.4\%\) BD-rate reduction is achieved with a \(116.1\%\) relative training time. As it has been observed for the image classification experiment in Fig. 8, a similar behavior for the semantic segmentation machine task is shown in Fig. 10 where the proposed training procedure \(\mathcal {T}_{3}\) is able to outpace more and more the anchor \(\mathcal {T}_{0}\) as the bitrate diminishes.

On the other hand, four main differences are observed between the results on image classification and semantic segmentation, explained as follows:

  1. I.

    Observation: There is no statistically significant difference between \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\) that use the progressive training procedure alone or jointly with the added regularization term, respectively. Explanation: DNN trained on the Cityscapes dataset are more prone to overfitting compared to the ImageNet dataset, as Cityscapes does not incorporate any coding artifacts and is composed of several orders of magnitude fewer images. In such a context, the added regularization term in \(\mathcal {T}_{2}\) and \(\mathcal {T}_{3}\) might not be sufficient to prevent strong overfitting phenomena.

  2. II.

    Observation: The machine performance reached after convergence is subject to a higher variance across identical training compared to the experiment on image classification. One symptom of this observation is the noisy aspect of curves associated with training procedures \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\). Explanation: As Cityscapes is a smaller dataset in terms of the number of available training images, the DNN performance that is reached at convergence tends to be noisy from one training to another.

  3. III.

    Observation: The DA training procedure achieves greater coding efficiency over the anchor \(\mathcal {T}_{0}\) on the semantic segmentation machine task, while the opposite was observed on the image classification machine task. Explanation: It is arguable that a dataset with fewer training samples will exhibit greater benefits from the DA training procedure. Indeed, by training one DNN on images encoded with random JPEG qualities, DA artificially augments the number of unique training samples encountered at training time. As DA achieves greater coding efficiency than the anchor \(\mathcal {T}_{0}\) on the semantic segmentation experiment, such benefits from DA appear to outweigh the limitation of utilizing a single set of model weights to generalize across every degradation type.

  4. IV.

    Observation: Although BD-rate scores between the two experiments cannot be directly compared as the employed machine performance measures are different, greater bitrate reduction seems to be obtained for the semantic segmentation machine task. Explanation: The Cityscapes dataset does not incorporate a single JPEG compressed image. As a consequence and compared to ImageNet, there is a higher statistical shift between images on which the baseline machine model has been pre-trained and compressed images on which machine models are re-trained through \(\mathcal {T}_{0}\), DA, \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\). Therefore, it is arguable that DA, \(\mathcal {T}_{1}\), and \(\mathcal {T}_{3}\) provide greater coding efficiency for the semantic segmentation machine task as these training procedures artificially augment the number of unique samples encountered at training time by utilizing a variety of JPEG qualities throughout the training. This explanation is further supported by the observation that baseline* consistently outperforms the baseline procedure for all JPEG qualities \(Q = \{30, 20, 15, 10, 7, 5\}\). This demonstrates that, for DNN models, high JPEG quality images share more similarities with lower JPEG qualities than pristine images do.

Fig. 9
figure 9

Trade-off between relative training time and BD-rate compared to the anchor \(\mathcal {T}_{0}\) for the semantic segmentation machine task. Each point of \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\) are obtained with convergence speed \(s \in \{0.03, 0.05, 0.075, 0.1\}\). Points surrounded with dashed circles are those whose curves between bitrate and mIoU are shown in Fig. 10

Fig. 10
figure 10

Comparison of bpp and mIoU for the pre-trained baseline procedure, the anchor \(\mathcal {T}_{0}\), and the training procedure \(\mathcal {T}_{3}\) with convergence speed \(s=0.05\). Shown curves between bitrate and mIoU are those whose points are surrounded with dashed circles in Fig. 9

6 Discussion

While the proposed training procedure \(\mathcal {T}_{3}\) offers BD-rate reduction over the anchor \(\mathcal {T}_{0}\) for both considered machine tasks, its non-adaptive nature should be noted. For both the image classification and semantic segmentation experiment, the slowest convergence speed \(s\) is set to \(0.03\). Consequently, the number of epochs required to reach the JPEG compression with the lowest quality is the same for both machine tasks. However, the training procedure \(\mathcal {T}_{3}\) with the slowest convergence speed \(s=0.03\) is \(1.09\) and \(1.91\) times more complex in terms of training time relative to anchor \(\mathcal {T}_{0}\) for the image classification and semantic segmentation machine tasks, respectively. This observation highlights that the required number of epochs for the anchor \(\mathcal {T}_{0}\) to converge is lower for the semantic segmentation machine task. Therefore, it might be more suitable for this experiment to increase the convergence speed towards quality \(Q _{\infty }\) in an automated and adaptive manner. Instead of using an arbitrary fixed exponential decay function as in Eq. 1 to lower image quality as the training progresses, one potential solution would be to continue training on a given image quality level until signs of overfitting emerge. Overfitting could be detected by assessing the machine performance on both the training and validation sets at each epoch. Similarly to \(\mathcal {T}_{1}\) and \(\mathcal {T}_{3}\), coding distortions would succeed one after the other from lower to higher distortion strengths as the training advances. Evaluating this training procedure against existing approaches would require a separate testing set, as the validation set is already employed during training to detect over-fitting. Since ground truth labels for the testing set may not be provided for some datasets [18, 19], cross-validation could be employed to build separate training, validation, and testing sets.

Another limitation of \(\mathcal {T}_{1}\), \(\mathcal {T}_{2}\), and \(\mathcal {T}_{3}\) resides in the use of separate DNN weights for each coding distortion. For instance, the experiment detailed in Sect. 4 necessitated the utilization of \(\# Q = 6\) distinct DNN weights for these training procedures, as the experiment is based on the JPEG encoder with qualities \(Q = \{30, 20, 15, 10, 7, 5\}\). These DNN weights exhibit strong redundancies, as small variations in quality lead to artifacts that are similar in terms of type and strength. When comparing the re-trained ResNet50 [22] weights from the image classification experiment to those used at initialization, very small variations in the kernel convolution filters have been observed, where the order of magnitude of differences is around \(10^{-4}\). Consequently, the Neural Network Compression and Representation (NNR) [53] standard developed by MPEG could be employed to represent re-trained DNN weights in a more compact manner by compressing the differences relative to the initialization weights. Furthermore, as shown in the previous work [17], deep learning-based machine algorithms tend to generalize better when the coding distortion during the evaluation phase shares similarities with the one used at training. This generalization capability could be exploited by employing a single DNN model for multiple coding distortions that share similarities, such as images encoded with a high quantization or low level. To ensure that a single DNN model delivers accurate predictions for a diverse set of reconstructed image qualities, each batch at training time could consist of images compressed by one specific coding distortion sampled from a predefined Gaussian probability distribution. These proposals can be regarded as a compromise between having separate DNN model weights for each coding distortion [16] and the data augmentation [15] training procedure that relies on a single DNN model but may result in BD-rate increase.

7 Conclusion

This work studied several advanced fine-tuning procedures to enhance the resilience of machine task algorithms to coding artifacts. It complemented existing standardization efforts of MPEG/JVET on VCM by solely addressing the re-training aspect of machine tasks. In particular, three training procedures were proposed: \(\mathcal {T}_{1}\) which increases the distortion strength progressively as the training advances; \(\mathcal {T}_{2}\) which incorporates an additional regularization term to the original loss function used in training; and \(\mathcal {T}_{3}\) which combines \(\mathcal {T}_{1}\) and \(\mathcal {T}_{2}\). As these training procedures only involve the offline re-training process, the proposed training procedures had no impact on the online inference nor the encoding time once deployed in real-world applications. Moreover, our proposals allowed us to strike an optimal balance between the training time and achieved coding gain.

The proposed training procedures were evaluated through extensive experiments on two highly dissimilar machine tasks and datasets: image classification on the ImageNet dataset and semantic segmentation on the Cityscapes dataset. Results revealed that our joint training procedure is shown to obtain a \(2.4\%\) BD-rate decrease in image classification and a \(7.4\%\) BD-rate in semantic segmentation with a reduced training time in both cases. A slight increase in training time can bring up to a \(9.4\%\) BD-rate reduction for the semantic segmentation.

The future work on the proposed training procedures involves adjusting the convergence speed from slightly to highly distorted images adaptively. Additionally, the trained DNN models need to be generalized across various coding conditions to ease the deployment of such training procedures in practical use cases.

Availability of data and materials

Not applicable.

References

  1. G.K. Wallace, The JPEG still picture compression standard. Commun. ACM 34(4), 30–44 (1991). https://doi.org/10.1145/103085.103089

    Article  Google Scholar 

  2. J. Lainema, M.M. Hannuksela, V.K.M. Vadakital, E.B Aksu, HEVC still image coding and high efficiency image file format. In: 2016 IEEE International Conference on Image Processing (ICIP), 71–75 (2016). https://doi.org/10.1109/ICIP.2016.7532321

  3. T. Wiegand, G.J. Sullivan, G. Bjøntegaard, A. Luthra, Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 13(7), 560–576 (2003). https://doi.org/10.1109/TCSVT.2003.815165

    Article  Google Scholar 

  4. G.J. Sullivan, J. Ohm, W. Han, T. Wiegand, Overview of the High Efficiency Video Coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012). https://doi.org/10.1109/TCSVT.2012.2221191

    Article  Google Scholar 

  5. B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G.J. Sullivan, J.-R. Ohm, Overview of the Versatile Video Coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 31(10), 3736–3764 (2021). https://doi.org/10.1109/TCSVT.2021.3101953

    Article  Google Scholar 

  6. Y. Zhang, P. Dong, MPEG-M49944: Report of the AhG on VCM. Moving Picture Experts Group (MPEG) of ISO/IEC JTC1/SC29/WG11, Geneva, Switzerland, Tech. Rep. (2019)

  7. L. Duan, J. Liu, W. Yang, T. Huang, W. Gao, Video coding for machines: a paradigm of collaborative compression and intelligent analytics. IEEE Trans. Image Process. 29, 8680–8695 (2020). https://doi.org/10.1109/TIP.2020.3016485

    Article  Google Scholar 

  8. S. Dodge, L. Karam, Understanding how image quality affects deep neural networks. In: 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), 1–6 (2016). https://doi.org/10.1109/QoMEX.2016.7498955

  9. P. Roy, S. Ghosh, S. Bhattacharya, U. Pal, Effects of Degradations on Deep Neural Network Architectures. CoRR abs/1807.10108 (2018). _eprint: 1807.10108

  10. M. Dejean-Servières, K. Desnos, K. Abdelouahab, W. Hamidouche, L. Morin, M. Pelcat, Study of the impact of standard image compression techniques on performance of image classification with a convolutional neural network. PhD Thesis, INSA Rennes; Univ Rennes; IETR; Institut Pascal (2017)

  11. D. Hendrycks, T.G. Dietterich, Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. CoRR abs/1903.12261 (2019). _eprint: 1903.12261

  12. K. Fischer, C. Herglotz, A. Kaup, On intra video coding and in-loop filtering for neural object detection networks. In: 2020 IEEE International Conference on Image Processing (ICIP), 1147–1151 (2020). https://doi.org/10.1109/ICIP40778.2020.9191023

  13. K. Fischer, C. Forsch, C. Herglotz, A. Kaup, Analysis of neural image compression networks for machine-to-machine communication. In: 2021 IEEE International Conference on Image Processing (ICIP), 2079–2083 (2021). https://doi.org/10.1109/ICIP42928.2021.9506763

  14. J. Löhdefink, A. Bär, N.M. Schmidt, F. Hüger, P. Schlicht, T. Fingscheidt, On low-bitrate image compression for distributed automotive perception: higher peak SNR does not mean better semantic segmentation. In: 2019 IEEE Intelligent Vehicles Symposium (IV), 424–431 (2019). https://doi.org/10.1109/IVS.2019.8813813

  15. K. Fischer, C. Blum, C. Herglotz, A. Kaup, Robust deep neural object detection and segmentation for automotive driving scenario with compressed image data. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 1–5 (2021). https://doi.org/10.1109/ISCAS51556.2021.9401621

  16. A. Marie, K. Desnos, L. Morin, L. Zhang, Video coding for machines: large-scale evaluation of deep neural networks robustness to compression artifacts for semantic segmentation. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China (2022). https://doi.org/10.1109/MMSP55362.2022.9949999 . https://hal.science/hal-03831514

  17. A Marie, K. Desnos, L. Morin, L. Zhang, Expert training: enhancing AI resilience to image coding artifacts. In: Electronic Imaging, Image Processing: Algorithms and Systems XX, San Francisco, United States (2022). https://hal.archives-ouvertes.fr/hal-03555716

  18. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  19. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  20. S. Dodge, L. Karam, Human and DNN classification performance on images with quality distortions: a comparative study. ACM Trans. Appl. Percept. (TAP) 16(2), 1–17 (2019). (Publisher: ACM New York, NY, USA)

    Article  Google Scholar 

  21. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. NIPS’12, 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012). event-place: Lake Tahoe, Nevada

  22. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  23. Z. Sun, M. Ozay, Y. Zhang, X. Liu, T. Okatani, Feature quantization for defending against distortion of images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7957–7966 (2018). https://doi.org/10.1109/CVPR.2018.00830

  24. S. Zheng, Y. Song, T. Leung, I. Goodfellow, Improving the robustness of deep neural networks via stability training. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4480–4488 (2016). https://doi.org/10.1109/CVPR.2016.485

  25. H. Kannan, A. Kurakin, I.J. Goodfellow, Adversarial Logit Pairing. CoRR abs/1803.06373 (2018). _eprint: 1803.06373

  26. Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, M. Song, Neural style transfer: a review. IEEE Trans. Vis. Comput. Gr. 26(11), 3365–3385 (2020). https://doi.org/10.1109/TVCG.2019.2921336

    Article  Google Scholar 

  27. R. Geirhos, P. Rubisch, C. Michaelis, M Bethge, F.A. Wichmann, W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 (2018)

  28. S. Ghosh, R. Shet, P. Amon, A. Hutter, A. Kaup, Robustness of deep convolutional neural networks for image degradations. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2916–2920 (2018). https://doi.org/10.1109/ICASSP.2018.8461907

  29. S.P. Kannojia, G. Jaiswal, Effects of varying resolution on performance of CNN based image classification: an experimental study. Int. J. Comput. Sci. Eng. 6(9), 451–456 (2018)

    Google Scholar 

  30. S. Suvash, H. Christopher, C. Daniel, D. Matt, J.E. Ball, T. Bo, G. Chris, D. Lalitha, Performance analysis of semantic segmentation algorithms trained with JPEG compressed datasets, 11401 (2020). https://doi.org/10.1117/12.2557928 . https://doi.org/10.1117/12.2557928

  31. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollãr, C.L. Zitnick, Microsoft COCO: Common objects in context, in Computer vision-ECCV 2014. ed. by D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Springer, Cham, 2014), pp.740–755

    Chapter  Google Scholar 

  32. Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09, 41–48. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1553374.1553380. Event-place: Montreal, Quebec, Canada.

  33. P. Soviany, R.T. Ionescu, P. Rota, N. Sebe, Curriculum learning: a survey. Int. J. Comput. Vis. 130(6), 1526–1565 (2022). https://doi.org/10.1007/s11263-022-01611-x

    Article  Google Scholar 

  34. M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, Q.V. Le, MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2815–2823 (2019). https://doi.org/10.1109/CVPR.2019.00293

  35. L. Kong, R. Dai, Object-detection-based video compression for wireless surveillance systems. IEEE MultiMedia 24(2), 76–85 (2017). https://doi.org/10.1109/MMUL.2017.29

    Article  Google Scholar 

  36. M. Aqqa, P. Mantini, S.K. Shah, Understanding how video quality affects object detection algorithms. In: VISIGRAPP (5: VISAPP), 96–104 (2019)

  37. K. Fischer, F. Brand, C. Herglotz, A. Kaup, Video coding for machines with feature-based rate-distortion optimization. In: 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 1–6 (2020). https://doi.org/10.1109/MMSP48831.2020.9287136

  38. B. Stabernack, F. Steinert, Architecture of a Low Latency H.264/AVC Video codec for robust ML based image classification. In: Workshop on Design and Architectures for Signal and Image Processing (14th Edition). DASIP ’21, 1–9. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3441110.3441149 . Event-place: Budapest, Hungary.

  39. N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, E. Rahtu, Image coding for machines: an end-to-end learned approach. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1590–1594 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414465

  40. K. Fischer, F. Fleckenstein, C. Herglotz, A. Kaup, Saliency-driven versatile video coding for neural object detection. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1505–1509 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415048

  41. K. Fischer, M. Hofbauer, C. Kuhn, E. Steinbach, A. Kaup, Evaluation of video coding for machines without ground truth. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1616–1620 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747633

  42. G. Hudson, A. Léger, B. Niss, I. Sebestyén, J. Vaaben, JPEG-1 standard 25 years: past, present, and future reasons for a success. J. Electron. Imaging 27(4), 040901 (2018). https://doi.org/10.1117/1.JEI.27.4.040901. (Publisher: SPIE)

    Article  Google Scholar 

  43. F. Bellard, Better portable graphics (2014). https://bellard.org/bpg/

  44. K. Suehring, H.264/AVC Software coordination JM reference software (2003). https://avc.hhi.fraunhofer.de/

  45. V. organisation, x265 software library (2013). https://www.videolan.org/developers/x265.html

  46. J. Brandenburg, A. Wieckowski, T. Hinz, B. Bross, VVenC Fraunhofer versatile video encoder. cit. on, 3 (2020)

  47. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

  48. M. Contributors, MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark (2020). https://github.com/open-mmlab/mmsegmentation

  49. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980

  50. G. Bjøntegaard, Improvements of the BD-PSNR model. VCEG-AI11 (2008). Publisher: ITU-T Q.6/SG16

  51. S. Liu, C. Hollman, Common test conditions for video coding for machines. In: JVET-AF2031-v1, Rennes, France (2024)

  52. Karim El Khoury, Martin Fockedey, Eliott Brion, Benoît Macq, Improved 3D U-Net robustness against JPEG 2000 compression for male pelvic organ segmentation in radiotherapy. J. Med. Imaging 8(4), 1–20 (2021). https://doi.org/10.1117/1.JMI.8.4.041207

    Article  Google Scholar 

  53. H. Kirchhoffer, P. Haase, W. Samek, K. Müller, H. Rezazadegan-Tavakoli, F. Cricri, E.B. Aksu, M.M. Hannuksela, W. Jiang, W. Wang, S. Liu, S. Jain, S. Hamidi-Rad, F. Racapé, W. Bailer, Overview of the Neural Network Compression and Representation (NNR) standard. IEEE Trans. Circuits Syst. Video Technol. 32(5), 3203–3216 (2022). https://doi.org/10.1109/TCSVT.2021.3095970

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Open access funding provided by Tampere University (including Tampere University Hospital). This work was supported by the Research Council of Finland (Decision No. 349216).

Author information

Authors and Affiliations

Authors

Contributions

A. Marie reviewed the literature, conceived and implemented the proposed contributions, conducted the experiments, interpreted the results, and wrote the manuscript. K. Desnos supervised the research and helped with the writing of the manuscript. A. Mercat actively worked on the writing of the manuscript. J. Vanne helped with the writing of the manuscript. L. Morin and L. Zhang supervised the research.

Corresponding author

Correspondence to Alban Marie.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marie, A., Desnos, K., Mercat, A. et al. Advanced fine-tuning procedures to enhance DNN robustness in visual coding for machines. J Image Video Proc. 2024, 31 (2024). https://doi.org/10.1186/s13640-024-00650-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00650-3

Keywords