Joint multi-domain feature learning for image steganalysis based on CNN

In recent years, researchers have been making great progress in the steganalysis technology based on convolution neural networks (CNN). However, experts ignore the contribution of nonlinear residual and joint domain detection to steganalysis, and how to detect the adaptive steganographic algorithms with low embedding rates is still challenging. In this paper, we propose a CNN steganalysis model that uses a joint domain detection mechanism and a nonlinear detection mechanism. For the nonlinear detection mechanism, based on the spatial rich model (SRM), we introduce the maximum and minimum nonlinear residual feature acquisition method into the model to adapt to the nonlinear distribution of steganography information. For the joint domain detection mechanism, we not only apply the high-pass filters from the SRM for spatial residuals, but also apply the patterns from the discrete cosine transform residual (DCTR) for transformation steganographic impacts, so as to fully capture the interference trace of spatial steganography to transform domain. We also apply a new transfer learning method to improve the model’s performance. That is, we apply the low embedding rate steganography samples to initialize the model, because we think that the method makes the network more sensitive than applying high embedding rate steganography samples to initialize the model. The simulation results also confirm this assumption. Combined with the above improved methods, the detection accuracy of the model for WOW and S-UNIWARD is higher than that of SRM+EC, Ye-Net, Xu-Net, Yedroudj-Net and Zhu-Net, which is about 4 ∼6% higher than that of the optimal Zhu-Net. The results can provide a certain reference for steganalysis and image forensics tasks.

(2020) 2020: 28 Page 2 of 12 and UERD [9]. In the spatial domain, steganography algorithms are characterized by directly changing the pixels. The typical algorithms are the least significant bit (LSB) [10,11], LSB matching [12], and pixel value differencing (PVD) [13]. There are also some steganography algorithms in the compression domain [14]. Those algorithms above can be regarded as the non-adaptive steganography algorithms. Compared with nonadaptive steganography algorithms, the adaptive steganography algorithms have been proved to have better performance. At present, the popular adaptive algorithms are edge adaptive image steganography (EA) [15], HUGO [16], HILL [17], MiPOD [18], and S-UNIWARD [6]. And since the security of steganographic algorithms keeps increasing [19][20][21], attempts to detect such data hiding methods encounter more challenges. A traditional steganalysis method is often based on a manual designed feature. The most often adopted features include a gray level co-occurrence matrix (GLCM) [22], local binary patterns(LBP) [23], and Gaussian Markov random field. In addition, some perceptual hashing techniques can be applied for steganalysis tasks. The perceptual hash technology [24,25] can be used to extract information closely related to human perception of image visual quality, so these perceptual description models can also detect the tampering trace of steganography. Since the spatial rich model (SRM) [26] algorithm was proposed, a lot of research has been focused on SRM algorithm, and many improved algorithms have been proposed. Although these algorithms have made performance improvements, they fail to solve the key shortcomings of the feature extraction method. The traditional feature extraction method relies on the characteristics of manual design, the design process depends on the expert experience, and the heuristic method is usually applied. It means that this kind of steganalysis algorithm is difficult to deal with the challenge brought by the rapid development of steganalysis algorithm.
Deep learning technology can effectively solve the problems caused by manual feature design and is widely used in the field of image perception [27,28] and steganalysis. Deep learning technology can automatically recognize and extract features through deep network, which makes steganalysis technology possible to get rid of the dependence on expert experience. With the development of graphics processing unit (GPU) and parallel computing technology, this process has been accelerated. In 2014, Tan and Li [29] proposed the first steganalysis model that applied deep learning techniques. In 2015, Qian et al. [30] proposed the first convolution neural network (CNN) model using supervised learning methods, whose steganalysis performance surpasses SRM. In 2016, Xu et al. [31] proposed a CNN model similar to Qian's model, the so-called Xu-Net. The difference is that an absolute value layer (ABS) and a 1×1 convolution kernel are employed in the Xu-Net. Recently, Qian et al. [32] brought forward a creative concept, called the transfer learning, to improve steganalysis performance. The above models only capture spatial steganography features, so they are used to detect spatial steganography algorithm. Until 2017, the research results of detection for steganography algorithm in transformation domain gradually appeared. Zeng et al. [33,34] proposed a JPEG-based steganalysis model. Xu et al. [35], inspired by ResNet [36], proposed a new CNN steganalysis model consisting of 20 convolutional layers with batch normalization (BN). Ye et al. [37] proposed a spatial domain CNN steganalysis model, and they added a truncated linear unit (TLU) activation function to the preprocessing layer. The main trend in 2017 was to optimize the convolution neural network architecture through ResNet and draw on the feature extraction method of SRM. In 2018, Yedroudj et al. [38] proposed a spatial domain CNN steganalysis model consisting of five convolutional layers. In addition to the traditional image datasets BOSSBass [39], they added the BOWS2 [40] image datasets. Tsang et al. [41] improved Ye-Net, which made the model perform steganalysis on high-resolution images. Zhang et al. [42] proposed a new CNN steganalysis model, and they used the depth separable convolution network and spatial pyramid pooling (SPP) to obtain the channel correlation and adapt to different sizes of images. Deep steganalysis technology has made remarkable progress, but there is still much room for improvement. The existing deep steganalysis technology adopts single domain mode, that is, only spatial features are captured when detecting spatial steganography, and the same is true when detecting transformation domain steganography. However, steganography in transformation domain will destroy the spatial characteristics of image, and vice versa. So, the joint domain detection can better capture the trace of steganography.
In this paper, we propose a novel spatial domain steganalysis model called Wang-Net. It has the following characteristics: (1) A joint domain detection concept is brought forward. Joint domain detection is to capture steganography features in both spatial and transformation domain to complete steganography detection task. At present, the typical spatial and transformation domain steganalysis models are Zhu-Net and Xu-Net, which only extract single domain features. However, the steganography of one domain will affect other domains, so joint domain detection method can capture more comprehensive steganography information. We simulate SRM and discrete cosine transform residual (DCTR) feature extraction methods to detect steganography feature in both spatial and transformation domain.
(2) The nonlinear feature detection mechanism is introduced. The nonlinear detection mechanism is to capture the steganographic features through nonlinear transformation. At present, the famous Ye-Net and Zhu-Net all simulate the linear feature extraction method of SRM to complete the steganalysis task. However, the embedding of steganography information is nonlinear, so it is necessary to introduce nonlinear detection mechanism. We simulate the nonlinear feature extraction of SRM to complete the design and implementation of the nonlinear detection mechanism.
(3) A new transfer learning method is applied. For Zhu-Net and other steganalysis models using the transfer learning method, the authors use high embedding rate samples to initialize the model, in order to solve the problem that the model in the training stage is difficult to converge to the low embedding rate samples. Compared with high embedding rate samples, low embedding rate samples have less steganography information with the same steganography mode. For the transfer learning method in this paper, the low embedding rate samples are used to initialize the model, in order to enhance the sensitivity of the model to steganography information.

SRM
The feature extraction method of SRM steganalysis can be seen in Fig. 1. Firstly, the residual map sub-models are obtained by the high-pass filter, then the fourth-order co-occurrence matrix of each residual map sub-model is extracted by quantization, rounding, and truncation. Finally, the elements of these co-occurrence matrices are rearranged to form the steganalysis feature vector. Scholars design various HPFs in SRM and use them to generate residual map submodels. The original linear residual calculation formula is as follows.
where c is called the residual order, m and n represent the pixel coordinates, N mn is the adjacent pixel of image I mn , pred(N mn ) is the predictor of cI mn , and R mn is the residual of image I mn . Generally, the number of pixels of N mn is equal to c.
The residuals mainly include first-order, second-order, third-order, SQUARE, EDGE3x3, and EDGE5x5 six types, and each type of residuals is divided into linear filtering residuals and nonlinear filtering residuals. The typical residuals and high-pass filters can be seen in Table 1, Eqs.(2), (3), and (4). As shown in Eq.(2), the residuals in the horizontal, vertical, diagonal, and anti-angular directions are denoted as R h , R v , R d ,andR m . The max nonlinear filtering residual is denoted as R max , the min nonlinear filtering residual is denoted as R min . As shown in Eq.(3), the left side is the SQUARE3x3 high-pass filter, and the right side is the SQUARE5x5 high-pass filter. As shown in Eq.(4), the left and right sides are the EDGE3x3 and EDGE5x5 high-pass filter respectively.
For the linear residual, Table 1 has given the calculation method of the first-order, second-order, and third-order linear residuals. It is not difficult to find that the linear residuals of SQUARE, EDGE3x3, and EDGE5x5 only apply more directional neighborhood pixels in the calculation. The SQUARE, EDGE3x3, and EDGE5x5 high-pass filters are shown in Eqs. (3) and (4). In fact, the linear residual calculation method can be converted to the convolution operation: where (i, j) is the pixel coordinates, and (r, c) is the index of the convolution kernel. x r,c i,j denotes the pixel value of the fixed neighborhood window index (r, c) of the central pixel (i, j). k r,c denotes the value of index (r, c) of the convolution kernel, which is the same size as the fixed neighborhood window. R ij denotes the result of convolution operation for pixels (i, j). I and K denotes the image and convolution kernel respectively. R denotes the residual for the whole image. * denotes the convolution operation. As shown in Eq. (2), the nonlinear residual can be obtained by finding the maximum or minimum of some linear filtering residuals. As shown in Table 1, we take the first-order linear residual as the residual prototype; there are totally eight first-order linear residuals: Then, the first-order nonlinear residual is: The nonlinear residuals combine the statistical characteristics of the same kind of linear residuals, which fully reflect the adjacent pixels changes in image caused by the steganography.

DCTR
In the transformation domain, the DCTR [43] is a general steganalysis algorithm. The steps of its feature processing are as follows: (1) Obtain 64 8×8 DCT bases patterns by calculation, then obtain feature maps by convoluting the decompressed JPEG image with the DCT basis patterns. (2) Obtain the sub-feature maps by quantifying and truncating the raw feature maps.
(3) Compress the sub-feature maps into an 8000-dimensional feature vector.
is the pixel coordinates. DCT is defined as the convolution operation of the image and 64 DCT basic patterns B (i,j) . In order to understand DCT better, we set the length and width of all images to a multiple of 8. When given a grayscale image I ∈ R M×N of size M × N (M, N is a multiple of 8): where , * denotes a non-padded convolution operation.

The proposed method
As shown in Fig. 2, our model consists of preprocessing layer, feature extraction layer, and classification layer. For the preprocessing layer, we simulate the SRM feature extraction method in the spatial domain and simulate the DCTR feature extraction method in the transformation domain and added the nonlinear residual features extraction method. For the general feature extraction stage, we design eight different convolution layers, together with the fully connected layer as the tenth layer for steganographic detection. Nonlinear feature extraction method, joint domain detection mechanism, and detailed designs are introduced in the following sections.

Nonlinear feature extraction method
At present, the linear feature extraction method of steganalysis model cannot perfectly adapt to the nonlinear embedding state of steganalysis information, so we design a nonlinear feature extraction method. For the linear feature extraction method, like Zhu-Net, we apply six types of SRM HPFs. All HPFs of the same type are composed of their basic "spams" filters and rotation variants, so as to capture multi-directional and comprehensive residual information in the same neighborhood. The pixel residual information captured by different types of HPFs has different statistical characteristics. And compared with lower-order HPFs, higher-order HPFs can capture pixel residual information of larger neighborhood. These six types of HPFs contain the first-order, second-order, third-order, SQUARE, EDGE3x3, and EDGE5x5, and the number of filters are 8, 4, 8, 2, 4, and 4, respectively. We obtain 30 linear residual feature maps through these 30 high-pass filters. For the nonlinear feature extraction method, we use SRM's nonlinear feature statistics method to capture the nonlinear residual feature map from these six types of HPFs. Specifically, SQUARE is divided into SQUARE3x3 and SQUARE5x5. SQUARE3x3, and EDGE3x3 belong to the same category, so there are two nonlinear residual feature maps in SQUARE3x3 and EDGE3x3. And there are also two nonlinear residual feature maps in SQUARE5x5 and EDGE5x5. Finally, we obtain a total of 10 nonlinear residual feature maps by statistics.
After simulating the linear and nonlinear feature extraction methods, we design two networks, called the single linear residual feature net (Linear Kernel-Net) and nonlinear residual feature net (Non-linear Kernel-Net), and carry out the steganography detection.
According to Table 2 Therefore, the adjacent pixel changes in the image caused by steganography are not comprehensively reflected, that is, the advantage of the nonlinear residual feature is not fully utilized. However, the accuracy of Non-linear Kernel-Net is higher than that of CNN steganalysis Network Ye-Net, indicating that the Non-linear Kernel-Net still has good competition and can enhance the feature representation. Therefore, we add linear and nonlinear residual features to our network named All Kernel-Net to continue the steganalysis.
According to the information in Table 3, for WOW (0.2 bpp), WOW (0.4 bpp), S-UNIWARD (0.2 bpp), and S-UNIWARD (0.4 bpp), the accuracy of All Kernel-Net is 0.714, 0.844, 0.669, and 0.792, which is about 0.4∼6% higher than that of Linear Kernel-Net and Non-linear Kernel-Net. All Kernel-Net combines the advantages of linear and nonlinear residual features and has a great steganalysis effect.

Joint domain detection mechanism
At present, Zhu-Net and other steganalysis models only capture the steganalysis features from a single domain, without considering the impact of steganalysis on other domains. And the steganography features captured by these models have the defect of singleness.  Therefore, we propose a joint domain detection mechanism based on All Kernel-Net and simulate the feature extraction method DCTR in the transformation domain. Small convolution kernels can effectively reduce the parameters scale, and matrix operations can take full advantage of the parallel computing. Therefore, referring to Zhu-Net, we design the convolution kernel as a matrix with 94 channels and 5×5 size, which was initialized with DCT patterns and HPFs. At this point, the calculation formula for the new DCT patterns is as follows: where In this way, we add the matrix initialized by DCT patterns and HPFs to the preprocessing layer. The network is called Wang-Net that combines the advantages of linear and nonlinear feature extraction in the spatial and transformation domains. We exert the steganalysis simulation.
According to the results in Table 4, the steganalysis accuracy of Wang-Net for WOW (0.2), WOW (0.4), S-UNIWARD (0.2), S-UNIWARD (0.4) are 0.749, 0.860, 0.691, and 0.819 respectively, which is about 2∼3% higher than that of All Kernel-Net. It strongly shows that the joint domain detection mechanism can force the model to learn richer and more comprehensive steganography features and achieve better steganography detection performance.

Detailed design in network architecture
Our network receives an image of 256×256 size and outputs two types of labels. Wang-Net consists of 10 network layers, including a preprocessing layer, eight convolutional layers for feature extraction, and a fully connected layer for result classification. For preprocessing layer, we apply a convolution kernel with the channel number of 94 and the size of 5×5, which is initialized by the SRM filters and DCT patterns. For the feature extraction process, 3×3 convolution kernels are applied in the layers 2, 3, 4, 8, and 9, and 5×5 convolution kernels are applied in the layers 5, 6, and 7. In each convolution layer, we add BN, rectified linear unit (ReLU), and TLU nonlinear activation functions. And we also add average pooling to the 4, 5, and 6 convolutional layers.

Simulation configuration
We applied two well-known content adaptive steganography algorithms to evaluate the performance of the CNN models, which are WOW and S-UNIWARD. And we use a randomly embedded key when applying the steganography algorithm, which is also in line with the actual steganography situation. The datasets applied in the simulations is the BOSS-Base 1.01. BOSSBase 1.01 contains 10,000 512×512 natural grayscale cover images taken directly from the camera, which have different texture features and are widely used in steganalysis. Due to the limitations of GPU computing resources, in the simulation, like Zhu-Net, we scale the images of BOSSBase 1.01 to 256×256 (using "imresize()" in matlab, the function parameter remains the default configuration). In the simulation, we apply the steganography algorithm and cover images to generate 10,000 corresponding stego images. In order to prevent overfitting, we need to allocate as much data as possible in the training process for our complex model with strong learning ability. Therefore, the data ratio of training datasets, verification datasets, and test datasets is 8:1:1. We apply the AdaDelta [44] to train the network model, which accelerates the convergence of the model. Due to GPU memory limitations, we set the mini-batch size to 16. We apply an exponential decay method with a decay rate of 0.95, a decay step of 2000, and an initial learning rate of 0.4. We also apply Xavier [45] to initialize the weights and biases in all convolution layers.

Results and discussions
In this section, we compare Wang-Net with existing spatial domain steganalysis models, such as SRM+EC, Xu-Net, Ye-Net, Yedroudj-Net, and Zhu-Net. Then, we apply the new migration learning method to the model, hoping to enhance the model's ability of steganography detection generalization. We compare the detection performance of Wang-Net with other steganalysis algorithms. The results are shown in the Table 5. The performance of Wang-Net is significantly better than that of traditional steganalysis algorithm SRM+EC and deep learning steganalysis algorithms Xu-Net, Ye-Net, and Yedroudj-Net, but the detection accuracy is about 2∼3% lower than that of Zhu-Net. It shows that Wang-Net has a good steganalysis performance.  In order to improve the generalization ability of steganography detection of the model, inspired by transfer learning, we propose a novel transfer learning method. We apply the datasets with lower embedding rates for training and compare the performance again.
According to the results in Table 6, for WOW (0.2), WOW (0.4), S-UNIWARD (0.2), and S-UNIWARD (0.4), the accuracy of Wang-Net are 0.812, 0.920, 0.777, and 0.888 respectively, which surpasses the Zhu-Net. It shows that Wang-Net can capture key steganographic traces under multiple embedding rates and has a good ability to express features. Now, our CNN model has the best steganalysis detection performance.
To sum up, after applying nonlinear detection mechanism, joint domain detection mechanism, and new migration learning method, Wang-Net can capture more abundant and diversified steganography semantic information and has better steganography detection performance.

Conclusion
In the field of steganalysis, it is of great significance for applying the CNN. In this paper, we propose a CNN steganalysis model with three great advantages.
(1) We creatively propose the nonlinear feature detection mechanism, and simulate the nonlinear features extraction method of SRM. For WOW and S-UNIWARD, the accuracy of the model is 0.3∼6% higher than that of the basic model. It shows that the nonlinear detection mechanism forces the model to adapt to the nonlinear distribution of steganography. (2) We pioneer the joint domain detection mechanism and simulate the manual feature extraction method of SRM in the spatial domain and DCTR in the transformation domain. For WOW and S-UNIWARD, the accuracy of the model is increased by 2∼3%. It shows that the joint domain detection mechanism can help the model capture more abundant steganography features. (3) We propose a model transfer learning method, which uses low embedding rate images to initial the model. For WOW (0.2), WOW (0.4), S-UNIWARD( 0.2), and S-UNIWARD (0.4), the accuracy of Wang-Net are 0.812, 0.920, 0.777, and 0.888 respectively, which is higher than that of the current Zhu-Net and other steganalysis models. It shows that Wang-Net can capture more levels of steganography features, which is conducive to the feature expression.
Our model is not suitable for steganalysis of color image. This issue will be addressed in further research.