RevNet and iRevNet architectures implement reversible transformations at the level of residual blocks. As we have seen in the previous section, the design of these reversible blocks creates a local memory bottleneck as all hidden activations within a reversible block need to be computed before the gradients are backpropagated through the block. In order to circumvent this local bottleneck, we introduce layer-wise invertible operations in section 4.2. However, these invertible operations introduce numerical errors, which we characterize in the following subsections. In section 5, we will show that these numerical errors lead to instabilities that degrade the model accuracy. Hence, in "Hybrid architecture", we propose a hybrid model combining layer-wise and residual block-wise reversible operations to stabilize training while resolving the local memory bottleneck at the cost of a small additional computational cost. Section 4.1 starts by motivating the need for, and the methodology of, our numerical error analysis.
Numerical error analysis
Invertible networks are defined as the composition of invertible operations. During the backward pass, each operation is supposed to reconstruct its input x given the value of its output y using its inverse function:
$$\begin{aligned} y&= f(x), \end{aligned}$$
(15a)
$$\begin{aligned} x&= f^{-1}(y). \end{aligned}$$
(15b)
In reality, however, the output of the network is an approximation of its true analytical value due to floating point numbers’ precision \(\hat{y}=y+\epsilon _y\). Hence, the noisy input \(\hat{x}\) reconstructed by the inverse operation contains a noise \(\epsilon _x\) due to the noise \(\epsilon _y\) in the output, and the error propagates through the successive inverse computations.
$$\begin{aligned} \hat{x}&= f^{-1}(y+\epsilon _y), \end{aligned}$$
(16a)
$$\begin{aligned} \hat{x}&= (x+\epsilon _x), \end{aligned}$$
(16b)
$$\begin{aligned} \epsilon _x&= x - f^{-1}(y+\epsilon _y). \end{aligned}$$
(16c)
The operation f may either refer to an individual layer, as is the case for the layer-wise invertible architecture we propose in this paper, or at the level of residual blocks as for the reversible blocks proposed in RevNet or iRevNet.
For each operation, we can compute the signal-to-noise ratio (SNR) of its output and input, respectively:
$$\begin{aligned} snr_o&= \frac{|y|^2}{|\epsilon ^y|^2}, \end{aligned}$$
(17a)
$$\begin{aligned} snr_i&= \frac{|x|^2}{|\epsilon ^x|^2}. \end{aligned}$$
(17b)
We are interested in characterizing the factor \(\alpha\) of reduction of the SNR through the inverse reconstruction:
$$\begin{aligned} \alpha = \frac{snr_i}{snr_o}. \end{aligned}$$
(18)
Indeed, given a layer i in a network, its input \(z_i\) will be reconstructed from the noisy network output \(\hat{y}\) by the composition of its upstream layers. Hence, the noise \(\epsilon _i\) in the reconstructed and noisy input \(\hat{z_i}\) can be computed as:
$$\begin{aligned} \hat{z_i}&= z_i + \epsilon _i, \end{aligned}$$
(19a)
$$\begin{aligned} \hat{z_i}&= f_i^{-1} \circ f_{i+1}^{-1} \circ \ldots \circ f_N^{-1}(\hat{y}), \end{aligned}$$
(19b)
$$\begin{aligned} | \epsilon _i |^2&= \frac{| \epsilon _y |^2 \times | z_i |^2}{| y |^2} \times \prod _i^{N} \alpha _j. \end{aligned}$$
(19c)
As \(z_i\) is used in the computation of layer i’s weights’ gradients according to Eq. 4, accumulated errors yield noisy gradients which prevent the network from converging as the SNR reaches certain levels. Hence, it is important to characterize the factor \(\alpha\) for the different invertible layers proposed below.
Layer-wise invertibility
In this section, we present invertible layers that act as drop-in replacement for convolution, batch normalization, pooling and non-linearity layers. We then characterize the numerical instabilities arising from the invertible batch normalization and non-linearities.
Invertible batch normalization
As batch normalization is not a bijective operation, it does not admit an analytical inverse. However, the inverse reconstruction of a batch normalization layer can be realized with minimal memory cost. Given first- and second-order moment parameters \(\beta\) and \(\gamma\), the forward f and inverse \(f^{-1}\) operation of an invertible batch normalization layer can be computed as follows:
$$\begin{aligned} y = f(x)&= \gamma \times \frac{x - \hat{x}}{\sqrt{\dot{x}} + \epsilon } + \beta , \end{aligned}$$
(20a)
$$\begin{aligned} x = f^{-1}(y, \hat{x}, \dot{x})&= (\sqrt{\dot{x}} + \epsilon ) \times \frac{y - \beta }{\gamma } + \hat{x}, \end{aligned}$$
(20b)
where \(\hat{x}\) and \(\dot{x}\) represent the mean and variance of x, respectively. Hence, the input activation x can be recovered from y through \(f^{-1}\) at the minimal memory cost of storing the input activation statistics \(\hat{x}\) and \(\dot{x}\).
The formula for the SNR reduction factor of the batch normalization is given below:
$$\begin{aligned} \alpha = \frac{\sum _i (\hat{x}_i^2 + \dot{x_i})}{\sum _i (\gamma _i^2 + \beta _i^2)} \times \frac{c}{\sum _i \frac{\sqrt{\dot{x_i}}+\epsilon }{\gamma _i}}, \end{aligned}$$
(21)
in which c represents the number of channels. The full proof of this formula is given in the Appendix. The only assumption made by this proof is that both the input x and output noise \(\epsilon ^y\) are identically distributed across all channels, which we have found to hold true in practice.
In essence, numerical instabilities in the inverse computation of the batch normalization layer arise from the fact that the signal across different channels i and j are amplified by different factors \(\gamma _i\) and \(\gamma _j\). While the signal amplification in the forward and inverse path cancel out each other (\(x=f^{-1}(f(x))\)), the noise only gets amplified in the backward pass, which degrades the reconstructed signal.
We verify the validity of equation (22ac) by empirically evaluating the different \(\alpha\) ratio yielded by a toy parameterization of the batch normalization using only two channels with parameters and \(\gamma = [1, \rho ]\). This toy parameterization has been used by the proof in the Appendix. The factor \(\rho\) there represents the imbalance in the multiplicative factor between both channels. Figure 5 shows the expected evolution of \(\alpha\) through our toy layer for different values of the factor \(\rho\). and find it to closely match the theoretical results we derived.
Finally, we propose the following modification, introducing the hyperparameter \(\epsilon _i\), to the invertible batch normalization layer:
$$\begin{aligned} y = f(x)&= |\gamma + \epsilon _i| \times \frac{x - \hat{x}}{\sqrt{\dot{x}} + \epsilon } + \beta , \end{aligned}$$
(22a)
$$\begin{aligned} x = f^{-1}(y)&= (\sqrt{\dot{x}} + \epsilon ) \times \frac{y - \beta }{|\gamma + \epsilon _i|} + \hat{x}. \end{aligned}$$
(22b)
The introduction of the \(\epsilon _i\) hyperparameter serves two purposes: first, it stabilizes the numerical errors described above by lower bounding the smallest \(\gamma\) parameters. Second, it prevents numerical instabilities that would otherwise arise from the inverse computation as \(\gamma\) parameters tend towards zero.
Invertible activation function
A good invertible activation function must be bijective (to guarantee the existence of an inverse function) and non-saturating (for numerical stability). For these properties, we focus our attention on Leaky ReLUs whose forward f and inverse \(f^{-1}\) computations are defined, for a negative slope parameter n, as follows:
$$\begin{aligned} y&= f(x) = {\left\{ \begin{array}{ll} x, &{} \text {if}\ x>0 \\ x / n, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(23a)
$$\begin{aligned} x&= f^{-1}(y) = {\left\{ \begin{array}{ll} y, &{} \text {if}\ y>0 \\ y \times n, &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(23b)
As derived in the Appendix, and following a similar proof to the batch normalization, we find the below formula for the SNR reduction factor:
$$\begin{aligned} \alpha = \frac{4}{(1+\frac{1}{n^2}) \times (1 + n^2)}. \end{aligned}$$
(24)
Hence numerical errors can be controlled by setting the value of the negative slope n. As n tends towards 1, \(\alpha\) converges to 1, yielding minimum signal degradation. However, as n tends towards 1, the network tends toward a linear behavior, which hurts the model expressivity. Figure 6 shows the evolution of the SNR degradation \(\alpha\) for different negative slopes n; and, in section 5, we investigate the impact of the negative slope parameter on the model accuracy.
It should be noted that this equation only holds for the regime \(|y|^2 \gg |\epsilon _y|^2\). When the noise reaches an amplitude similar to or greater than the activation signal, this equation no longer holds. However, in this regime, the signal-to-noise ratio becomes too low for training to converge, as numerical errors prevent any useful weight update. We have thus left the problem of characterizing this regime open.
Invertible convolutions
Invertible convolution layers can be defined in several ways. The inverse operation of a convolution is often referred to as deconvolution, and is defined for a subspace of the kernel weight space.
However, deconvolutions are computationally expensive and prone to numerical errors. Instead, we choose to implement invertible convolutions using the channel partitioning scheme of the reversible block for its simplicity, numerical stability and computational efficiency. Hence, invertible convolutions, in our architecture, can be seen as minimal reversible blocks in which both modules consist of a single convolution. Gomez et al. [6] found the numerical errors introduced by reversible blocks to have no impact on the model accuracy. Similarly, we found reversible blocks extremely stable yielding negligible numerical errors compared to the invertible Batch Normalization \(\alpha _{Rev} \ll \alpha _{BN}\) and Leaky ReLU layers \(\alpha _{Rev} \ll \alpha _{LReLU}\).
Pooling
In [7], the authors propose an invertible pooling operation that operates by stacking the neighboring elements of the pooling regions along the channel dimension. As noted in section 3.5, the increase in channel size at each pooling level induces a quadratic increase in the number of parameters of upstream convolution, which creates a new memory bottleneck.
To circumvent this quadratic increase in the memory cost of the weight, we propose a new pooling layer that stacks the elements of neighboring pooling regions along the batch size instead of the channel size. We refer to both kind of pooling as channel pooling \(\mathcal {P}_c\) and batch pooling \(\mathcal {P}_b\), respectively, depending on the dimension along which activation features are stacked. Given a \(2 \times 2\) pooling region and an input activation tensor x of dimensions \(bs \times c \times h \times w\), where bs refers to the batch size, c to the number of channels and \(h \times w\) to the spatial resolution, the reshaping operation performed by both pooling layers can be formalized as follows:
$$\begin{aligned} \mathcal {P}_c :&x \rightarrow y \end{aligned}$$
(25a)
$$\begin{aligned} :&\mathbb {R}^{bs \times c \times h \times w} \rightarrow \mathbb {R}^{bs \times 4c \times \frac{h}{2} \times \frac{w}{2}} \end{aligned}$$
(25b)
$$\begin{aligned} \mathcal {P}_b :&x \rightarrow y \end{aligned}$$
(25c)
$$\begin{aligned} :&\mathbb {R}^{bs \times c \times h \times w} \rightarrow \mathbb {R}^{4bs \times c \times \frac{h}{2} \times \frac{w}{2}}. \end{aligned}$$
(25d)
Channel pooling gives us a way to perform volume-preserving pooling operations while increasing the number of channels at a given layer of the architecture, while batch pooling gives us a way to perform volume-preserving pooling operations while keeping the number of channel constant. By alternating between channel and batch pooling, we can control the number of channels at each pooling level of the model’s architecture.
As this pooling operation only performs a reshaping between input and output, it does not induce any numerical error: \(\alpha _{Pool}=1.\)
Layer-wise invertible architecture
Putting together the above building blocks, Fig. 7 illustrates a layer-wise invertible architecture. The peak memory usage for a training iteration of this architecture, as parameterized in Table 1, can be computed as follows:
$$\begin{aligned} \mathcal {M}&= M_{\theta } + M_{z} + M_{g} \end{aligned}$$
(26a)
$$\begin{aligned}&= M_{\theta } + (M_z^{\prime} + M_{g}^{\prime}) \times (h \times w \times bs) \end{aligned}$$
(26b)
$$\begin{aligned}&= 29.6 \times 10^6 + 320 \times (h \times w \times bs). \end{aligned}$$
(26c)
Training an iteration over a typical batch of 32 images with resolution \(240 \times 240\) would require \(\mathcal {M}=590\)MB of VRAM. Similar to the RevNet architecture, the reconstruction of the hidden activations by inverse transformations during the backward pass comes with an additional computational cost similar to a forward pass.
As analyzed in the previous section, the numerical errors in this architecture are dominated by Batch Normalization and Leaky ReLU layers. Following equation 19, the numerical error associated with the activations at a given layer i in this architecture can thus be approximated by:
$$\begin{aligned} \epsilon _i |^2 = \frac{| \epsilon _y |^2 \times | z_i |^2}{| y |^2} \times \prod _i^{N} (\alpha _{LReLU} \times \alpha _{BN}), \end{aligned}$$
(27)
in which N represents the number of Batch Normalization and Leaky ReLU layers between the layer i and the output.
Hybrid architecture
In section 5.1, we saw that layer-wise activation and normalization layers degrade the signal-to-noise ratio of the reconstructed activations. In "Impact of numerical stability", we will quantify the accumulation of numerical errors through long chains of layer-wise invertible operations and show that numerical errors negatively impact model accuracy.
To prevent these numerical instabilities, we introduce a hybrid architecture, illustrated in Fig. 8, combining reversible residual blocks with layer-wise invertible functions. Conceptually, the role of the residual-level reversible block is to reconstruct the input activation of residual blocks with minimal errors, while the role of the layer-wise invertible layers is to efficiently recompute the hidden activations within the reversible residual blocks at the same time as the gradient propagates to circumvent the local memory bottleneck of the reversible module.
The backward pass through these hybrid reversible blocks is illustrated in Fig. 9 and proceeds as follows: first, the input x is computed from the output y through the analytical inverse of the reversible block. These computations are made without storing the hidden activation values of the sub-modules. Second, the gradient of the activations are propagated backward through the reversible of the block modules. As each layer within these modules is invertible, the hidden activation values are computed using the layer-wise inverse along the gradient.
The analytical inverse of the residual-level reversible blocks is used to propagate hidden activations with minimal reconstruction error to the lower modules, while layer-wise inversion allows us to alleviate the local bottleneck of the reversible block by computing the hidden activation values together with the backward flow of the gradients. As layer-wise inverses are only used for hidden feature computations within the scope of the reversible block, and reversible blocks are made of relatively short chains of operations, numerical errors do not accumulate up to a damaging degree.
The peak memory consumption of our proposed architecture, as illustrated in Fig. 8 and parameterized in Table 1, can be computed as:
$$\begin{aligned} \mathcal {M}&= M_{\theta } + M_{z} + M_{g} \end{aligned}$$
(28a)
$$\begin{aligned}&= M_{\theta } + (M_z^{\prime} + M_{g}^{\prime}) \times (h \times w \times bs) \end{aligned}$$
(28b)
$$\begin{aligned}&= 14.8 \times 10^6 + 352 \times (h \times w \times bs). \end{aligned}$$
(28c)
Training an iteration over batch of 32 images of resolution \(240 \times 240\) would require \(\mathcal {M}=648\)MB of VRAM.
It should be noted, however, that this architecture adds an extra computational cost as both the reversible block inverse and layer-wise inverse need to be computed. Hence, instead of one additional forward pass, as in the RevNet and layer-wise architectures, our hybrid architecture comes with a computational cost equivalent to performing two additional forward passes during the backward pass.
Following equation 19, the numerical error associated with the activations at a given layer i in this architecture are given by:
$$\begin{aligned} \epsilon _i |^2 = \frac{| \epsilon _y |^2 \times | z_i |^2}{| y |^2} \times \prod _i^{K} \alpha _{Rev}, \end{aligned}$$
(29)
in which K represents the number of reversible blocks between the layer i and the output.
Comparing this equation to equation 27, the stability of this architecture is due to the following two factors: first the number of reversible blocks K is typically two to three times smaller than the number of layers N as a reversible block is typically made of several convolutions: \(K<N\). Second, and most importantly, the SNR reduction factor is much smaller in reversible blocks than in Batch Normalization \(\alpha _{Rev} \ll \alpha _{BN}\) and Leaky ReLU layers \(\alpha _{Rev} \ll \alpha _{LReLU}\).