Models
Here we describe our proposed model architecture for improved performance on printed and scanned images. We compare our model’s performance with the model proposed in [9], the inspiration for the constrained convolutional layer. We additionally compare our model with XceptionNet (Xception) [34], as it and our proposed model have nearly identical number of parameters and similar architecture, so the difference in performance cannot be attributed to increased network capacity.
Proposed model
Our proposed architecture consists of one constrained convolutional layer [9], 1 convolutional layer, 34 separable convolutional layers, 5 pooling layers (4 max pooling, 1 global average pooling), and a final fully connected layer (see Fig. 1). Each convolutional layer was followed by ReLU activation, and max pooling layers were performed with a stride of 2 × 2.
In the constrained convolutional layer, a 5 × 5 filter is employed in which the sum of all the weights is constrained to be zero [9]. Specifically, the center pixel is predicted by the rest of the pixels in the field, and the output of the filter can be interpreted as the prediction error, as suggested by research in steganalysis [25]. Specifically, the weights in the filter are constrained such that:
$$\begin{aligned} {\left\{ \begin{array}{ll} w(0,0) &{} =-1\\ \sum _{l,m \ne 0 }w(l,m) &{} =1 \\ \end{array}\right. } \end{aligned}$$
where w refers to the weight, and l and m refer to the coordinates in the filter, where 0, 0 is the central weight.
The purpose of the constrained convolutional layer is to constrain the model to learn image manipulation fingerprints, rather than image content and higher order features, such as those useful for object recognition and classification tasks. The prediction error fields are then used as low-level forensic trace features by the rest of the network to assist in classifying global image manipulation detection.
For the separable convolutional layers, a spatial convolution is performed independently for each channel and is followed by a point-wise or 1 × 1 convolution, as proposed in [34]. These components decrease the number of free parameters allowing the deep network to learn effectively even with a small training set, making it particularly appropriate for our investigation.
In this approach, we hope to leverage both the SRM-like features produced by the convolutional layer as well as the improved generalization ability provided by the added depth and separable layers.
Bayar2016
Proposed in 2016, the constrained convolution method of image manipulation detection, hereafter referred to at Bayar2016, proposes a three-layer CNN, with two max-pooling layers and three fully-connected layers (including the initial constrained convolutional layer) [9]. This model demonstrates impressive results in discerning between the six manipulations investigated in this paper using the data set described in the next section, achieving 99.9% validation accuracy.
Xception
In addition to a the Bayar2016 shallow network, recent work has demonstrated that increasing network depth can dramatically improve model generalization. To compare with a model of similar depth that also uses separable convolutional layers, we experiment with XceptionNet, a deep network comprising of 42 layers, including separable convolutional layers [34]. The network design is built upon Inception architecture [35], with the innovation of separable filters. Similar to Bayar2016, this model also achieves near 99% accuracy on the data set described in [9] before printing and scanning. While a variety of popular deep learning models could be appropriate for comparison, we compare with Xception due to (1) its comparable architecture and number of parameters and (2) its demonstrated image classification performance, performing in the top 1% accuracy on ImageNet [34, 36].
Data sets
For accurate comparison, we follow the procedure described in [9], using images from the first IEEE IFSTC Image Forensics Challenge as described by [37]. The portion of the data set used consists of 3334 images of size 1024 × 768, which was further split into training, validation and testing data. The images are captured from several different digital cameras of both indoor and outdoor scenes.
Printing and scanning
We used three different printers and one scanner to create a data set of printed and scanned images: one Dell S3845CDN Laser Multifunction Printer, one Xerox Altalink C8070 Multifunction Printer, and one Xerox WorkCentre 7970 Multifunction Printer, which we refer to as Dell, Xerox1 and Xerox2 respectively hereafter. We printed 50 images of each manipulation type on each printer and used the Dell scanner to scan each image (see Fig. 2). After scanning and extracting the images from the resulting pdfs, the image sizes were 1700 x 2200 pixels, which was then center-cropped to 1536 × 1792 to remove the white border added by the scanning process. Each image was then split into 42 299 × 299 blocks (or 256 × 256 blocks for Bayar2016), resulting in 2142 image blocks of each class from each printer (see Fig. 3). We limited our data creation to 900 full-page color images both for budget constraints and environmental concerns; creating a synthetic data set through printing and scanning simulation may be an avenue of future work.
Manipulations
Again following the procedure described in [9], we manipulated each image with each of six manipulation types: Additive White Gaussian Noise (AWGN), Gaussian blurring (GB), JPEG compression (JPEG), median filtering (MF), re-sampling (RS) and retaining the Pristine image (PR).
-
Gaussian blurring blurs the image using a Gaussian filter by convolving the input image using a given kernal.
-
JPEG compression is a lossy compression method which compresses the image through converting the color map, down-sampling and Discrete Cosine Transform (DCT).
-
Median filtering replaces each pixel with the median value of the neighboring pixels using a given kernal area.
-
Bilinear resampling works similarly, resizing the image using the distance-weighted average of the neighboring pixels to estimate the new pixel value.
See Table 1 for manipulation parameter details.