The generalization of artificial neural models refers to their ability to adapt to the new, previously unseen data that come from the same distribution as that used when the model was learned. It means to transfer the knowledge acquired in the learning process to the new situation by referring to previously unseen test data, thus combining the new experience with previous experiences that are similar in one or more ways.
Neural networks learn from examples of patterns that represent the training database. In the learning phase, a network adopts its structure and parameters to respond properly to the input signals. From the statistical point of view, it corresponds to understanding the mechanism, based on which the learning data have been created [1,2,3]. Such a mechanism can be significantly distorted by the noise that interferes with the data. Therefore, it is useful to reduce the noise level and recover the denoised image. Such an approach, using information from multiple views via neural networks for image retrieval, has been presented, for example, in the papers [4, 5].
However, the biggest problem is that the network may not be complex enough to properly learn the mechanism of data generation, or there may be a situation where the population of learning data is too scarce and does not represent the process being modeled sufficiently well. The most important problem in obtaining good generalization properties of neural networks, especially deep structures, is the limitation of learning resources.
According to the theory of Vapnik and Chervonenkis [6], the population of learning samples should be sufficiently large in terms of the number of fitted parameters to produce a well-generalized neural model. In many cases, especially in deep learning, this condition is very difficult to satisfy [1]. Therefore, the test performance of a network may vary from one test set to another. To obtain the most objective measure of the generalization ability of the network, many repetitions of the learning and testing phases with different data are used, usually organized in K-fold cross-validation mode [7].
The generalization ability strongly depends on the relation between the size of learning data and the complexity of network architecture. The higher this ratio, the better probability of good performance of the network on the data not taking part in learning.
Many different techniques have been elaborated to improve the generalization ability of deep neural networks [8,9,10]. One of them is increasing the population of learning samples, based on the augmentation of data. Augmentation is a technique that is used to artificially expand the size of a training dataset by creating modified versions of data in the dataset. Different methods are proposed: flips, translations, rotations, scaling, cropping, adding the noise, non-negative matrix factorization, creating synthetic images using self-similarity, application of GAN technique or variational autoencoder, etc. [11,12,13,14,15,16]. However, in deep structures where the number of parameters is very high (millions of parameters), such a technique has limited efficiency.
A good way to increase the generalization is the regularization of the architecture. It is implemented by the modification of structure, as well as using different methods of learning. It was shown that the explicit forms of regularization, such as weight decay, dropout, and even data augmentation, do not adequately explain the generalization ability of deep networks [17, 18]. The empirical observations have shown that explicit regularization may improve the generalization performance of the network, but is neither necessary nor by itself sufficient for controlling the generalization error.
The important role fulfills the implicit regularization built into the learning algorithms. For example, stochastic gradient descent converges to a solution with a small norm, which might be interpreted as implicit regularization. A similar role performs early stopping and batch normalization in the learning procedure [19].
An important method for increasing generalization capability is the modification of network structures. It is especially popular when forming an ensemble of networks [15]. Different, independent team members, looking at the modeled process from a different point of view, form a so-called expert system, which makes it possible to generate a more objective decision.
Specific approaches have been proposed that allow increasing the independence of ensemble members. To such methods belong random choice of learning data used in training of particular units of an ensemble, application of mini-batches created randomly in the adaptation process of parameters, diversification of drop-out ratio of learning data, etc. Such techniques allow the creation of ensemble members that differ in operation in the hope of obtaining a more accurate classification of the test data that did not participate in the learning phase [9, 10]. All approaches: direct explicit regularization, augmentation of data, and modification of network structures are usually combined to develop a better generalizing system.
In our work, we take a step further to implicit regularization of deep structure. It combines the ensemble approach and random integration of the results at each level of signal processing. Two parallel structures are created and learned simultaneously. Their integration is based on the introduction of randomness in the formation of the subsequent layers of the CNN network in both architectures. We show that such a method leads to the improvement of the generalization ability at the limited size of learning data.
In each stage of the final structure formation, we form two parallel layers that perform the same task. Both have a similar form (same number of filters, kernel size, and padding parameters), but differ in parameter values and type of nonlinear activation function (here ReLU and softplus). In the final structure formation of the network, only one of these two layers is chosen and this choice is completely random. This random selection occurs at each level of signal processing, up to the final classification level of softmax.
The idea of such an approach follows from the observation of gradient methods in optimization, applied to the problem with many local minima (typical case in deep learning). Fixed parameters of the structure tend to the closest local solution, which is not necessarily the best one. Introducing a random choice at the level of each layer allows us to explore a wider range of possible solutions and find a better result.
The numerical experiments performed on the medical data representing melanoma and non-melanoma cases have confirmed the superiority of such an approach over a standard one, relying on the same type of activation function in each step of signal processing.