Printing and scanning investigation for image counter forensics

Examining the authenticity of images has become increasingly important as manipulation tools become more accessible and advanced. Recent work has shown that while CNN-based image manipulation detectors can successfully identify manipulations, they are also vulnerable to adversarial attacks, ranging from simple double JPEG compression to advanced pixel-based perturbation. In this paper we explore another method of highly plausible attack: printing and scanning. We demonstrate the vulnerability of two state-of-the-art models to this type of attack. We also propose a new machine learning model that performs comparably to these state-of-the-art models when trained and validated on printed and scanned images. Of the three models, our proposed model outperforms the others when trained and validated on images from a single printer. To facilitate this exploration, we create a data set of over 6000 printed and scanned image blocks. Further analysis suggests that variation between images produced from different printers is significant, large enough that good validation accuracy on images from one printer does not imply similar validation accuracy on identical images from a different printer.

. In pixel-based attacks, an adversary with knowledge of the CNN model in deployment can craft an "attacked" image which appears visually identical to the original image, but is mislabeled by the CNN [13]. This problem is well known in computer vision and has been at the forefront of recent work in field. However, this type of attack demands a certain level of expertise by the adversary, and is unlikely to be employed in a majority of cases in image forensics. Even for skilled adversaries, constructing pixel-based adversarial attacks is often labor-intensive, and recent work has cast doubt on the transferability of adversarial attacks in image forensics applications [14]. While pixel-based adversarial attacks require at least some knowledge of the model, a low-level adversarial manipulation such as double JPEG compression requires no such knowledge [11]. In this type of attack, the images are simply JPEG compressed after the manipulation has been applied, hampering the model's ability to correctly identify post-processing methods such as Additive White Gaussian Noise or median filtering [11]. For this reason, building models robust to low-level, simple adversarial manipulations such as JPEG double compression, to which several manipulation detection models have been found to be vulnerable [5,11,12], is particularly important. The goal of this paper is to investigate the vulnerability of state of the art models to another kind of low-level adversarial manipulation: printing and scanning. To our knowledge this is the first investigation into adversarial attack in digital image manipulations through printing and scanning. In physical forgery, repeated printing and scanning can be used to obscure manipulations or watermarks. A document may be modified, usually non-digitally, and then repeatedly printed and scanned to disguise the manipulation artifacts. While scanning a printed document is not always related to forgery, it is reasonable to expect that state-of-the-art models be impervious to this type of post-processing, as is noted in related work in double JPEG compression [5,11]. In addition, unlike complex pixelbased adversarial attacks, simply printing and scanning an image is both low-cost and requires little expertise, similar to JPEG compression.
In this paper, we limit our investigation to globally-applied manipulations, such as Gaussian Blurring (GB), Additive White Gaussian Noise (AWGN), and median filtering (MF), rather than local manipulations such as copy-move or splicing, as in related work [9]. We construct printed and scanned data sets from three different printers and experiment with two state-of-the-art models, as well as our own model. Related to our work is research involved in identifying camera models [15]-we additionally report results for identifying printer model. Our main contributions include the following: • We conduct the first analysis into the vulnerability of image manipulation detectors to printing and scanning, demonstrating that at least two state-of-the-art models are vulnerable to this type of highly plausible and inexpensive attack • We propose a model architecture which performs comparably than the stateof-the-art models when trained and evaluated on printed and scanned images, including performing 5% better when trained on images from a single printer • We conduct an in-depth analysis on the relationship between CNN-based image manipulation detectors, including training on composite data sets, and plan to share our data set of over 6000 printed and scanned images with the community to facilitate further investigation The rest of the paper is organized as follows. In Sect. 2, we give context and background through related work. In Sect. 3.1, we describe our model architecture, as well as those of the models we used for comparison. In Sect. 3.5, we describe the data sets used for training and validation. In Sect. 4, we explain the experiments conducted, and in Sect. 5, we discuss the results of these experiments. The paper ends in Sect. 6, where we summarize our conclusions and suggest areas of future research.

Related work
As this paper primarily investigates manipulation detectors based on convolutional neural networks (CNNs), we provide background on CNN-based manipulation detectors. Similarly, we provide context on adversarial attacks on CNNs generally as well as specifically on CNN-based image manipulation detectors. Related to this work is work on detecting manipulations through inconsistencies in lighting [16] and despite various compression qualities [17]. Additionally, [4] contributes significantly to this problem area, though without examining models that leverage deep learning. [17] explores a similar problem, but without addressing specifically the problem of printing and scanning in relation to CNN-based detectors, and is thus complementary to this work.

Deep learning for image forensics
Recent methods in image forensics techniques leveraging deep learning have reached impressive performance. In 2015, a CNN-based classifier was proposed for detecting median filtering in images [18]. Building on this work, [9] proposed CNN-based model with the addition of a "constrained convolutional layer", or a layer constrained to learn the high-pass features of an image by attempting to predict a central pixel based on its neighbors. This serves to suppress the image content while learning the manipulation fingerprint, drawing inspiration from Steganalysis Rich Model (SRM) filters in steganalysis [19]. In recent years, procedural similarities between SRM filters and learned CNN layers have been noted [20] and SRM filters have been used as a foundation for additional steganalytic and forensic methods. Accordingly, CNN architectures have been specially designed to account for SRM-like features, including methods that leverage absolute-value functions and TanH activation to learn steganalysis relevant features [21]. Additional methods include designing networks specifically for cases in which SRM filters yield weak signal [22], as well as methods that alter pool and stride hyperparameters in these cases [23].
Additionally, recent work has shown that third order subtractive pixel analysis matrix (S3SPAM) features can be learned by a simple shallow CNN, and can employ transfer learning to achieve good performance on little training data [24]. In addition to directly detecting manipulations, a deep learning method for analyzing the image processing history as an important component for image forensics has been proposed, as the processing history pipeline can affect the accuracy of other forensic tools [25].
The performance achieved by constrained convolutional layers and particularly deep networks is particularly impressive. These techniques serve as inspiration for our proposed model, and we thus compare our proposed model with models that leverage these modifications.

Adversarial attacks on CNNs
The vulnerability of CNNs to adversarial attacks has been well documented [13,26]. Adversarial noise can be designed in such a way that, when added to the image, can retain visual quality while misleading the classifier. For example, Fast Gradient Sign Method (FGSM) [13] leverages the differentiability of the loss function, assumed to be known to the adversary. The method proposes altering each pixel based on the gradient of the loss with respect to the original pixels in the input image. These changes small are enough such that the resulting image is visually nearly identical to the original, but are large enough cumulatively to increase the loss such as to impair the classification. Similarly, Projected Gradient Descent (PGD) [27] seeks a perturbation that maximises the loss on a specific input while keeping the perturbation size smaller than a given epsilon. DeepFool [28] uses a local linearization of the classifier to approximate the decision boundary and alter the images accordingly. The Jacobianbased Saliency Map Attack (JSMA) [26] uses a greedy iterative procedure, altering only the pixels which contribute most to the correct classification as identified by a saliency map. Each of these pixel-based adversarial attacks, while effective, requires at least partial knowledge of the network used for image manipulation detection. In contrast, low-level adversarial attacks such as JPEG compression or printing and scanning, the subject of this paper, require no such knowledge.

Adversarial attacks in image forensics
While CNN-based classifiers have achieved high performance on benchmark image forensic tasks, recent research in computer vision has demonstrated that CNN-based manipulation detectors, like CNNs more broadly, are highly vulnerable to adversarial attacks. For example, in [29], the authors demonstrate that a GAN-based architecture can conceal 3x3 median filtering manipulation, one of the manipulations we explore in this paper. This type of adversarial attack causes a detector to label the image as non-manipulated, including for the CNN-based detectors proposed in [9] and [18]. Additionally, a method of adversarial attack based on small pixel-based distortions has been proposed for fooling global image manipulation detectors [30]. However, [31] notes that unlike in most pattern recognition tasks, pixel-based adversarial attacks such as Fast Gradient Sign Method (FGSM) [13] and Jacobian-based Saliency Map Attack (JSMA) [26], are not for the most part transferable between manipulation detection models.
Recent work has explored the vulnerability of image manipulation detectors to lowresolution median filtering [32] and JPEG compression [11,12,33]. To our knowledge, ours is the first paper to examine model vulnerability to printing and scanning.

Models
Here we describe our proposed model architecture for improved performance on printed and scanned images. We compare our model's performance with the model proposed in [9], the inspiration for the constrained convolutional layer. We additionally compare our model with XceptionNet (Xception) [34], as it and our proposed model have nearly identical number of parameters and similar architecture, so the difference in performance cannot be attributed to increased network capacity.

Proposed model
Our proposed architecture consists of one constrained convolutional layer [9], 1 convolutional layer, 34 separable convolutional layers, 5 pooling layers (4 max pooling, 1 global average pooling), and a final fully connected layer (see Fig. 1). Each convolutional layer was followed by ReLU activation, and max pooling layers were performed with a stride of 2 × 2.
In the constrained convolutional layer, a 5 × 5 filter is employed in which the sum of all the weights is constrained to be zero [9]. Specifically, the center pixel is predicted by the rest of the pixels in the field, and the output of the filter can be interpreted as the prediction error, as suggested by research in steganalysis [25]. Specifically, the weights in the filter are constrained such that: where w refers to the weight, and l and m refer to the coordinates in the filter, where 0, 0 is the central weight.
The purpose of the constrained convolutional layer is to constrain the model to learn image manipulation fingerprints, rather than image content and higher order features, such as those useful for object recognition and classification tasks. The prediction error fields are then used as low-level forensic trace features by the rest of the network to assist in classifying global image manipulation detection.
For the separable convolutional layers, a spatial convolution is performed independently for each channel and is followed by a point-wise or 1 × 1 convolution, as proposed in [34]. These components decrease the number of free parameters allowing the deep network to learn effectively even with a small training set, making it particularly appropriate for our investigation.
In this approach, we hope to leverage both the SRM-like features produced by the convolutional layer as well as the improved generalization ability provided by the added depth and separable layers.

Bayar2016
Proposed in 2016, the constrained convolution method of image manipulation detection, hereafter referred to at Bayar2016, proposes a three-layer CNN, with two maxpooling layers and three fully-connected layers (including the initial constrained convolutional layer) [9]. This model demonstrates impressive results in discerning between the six manipulations investigated in this paper using the data set described in the next section, achieving 99.9% validation accuracy.

Xception
In addition to a the Bayar2016 shallow network, recent work has demonstrated that increasing network depth can dramatically improve model generalization. To compare with a model of similar depth that also uses separable convolutional layers, we experiment with XceptionNet, a deep network comprising of 42 layers, including separable convolutional layers [34]. The network design is built upon Inception architecture [35], with the innovation of separable filters. Similar to Bayar2016, this model also achieves near 99% accuracy on the data set described in [9] before printing and scanning. While a variety of popular deep learning models could be appropriate for comparison, we compare with Xception due to (1) its comparable architecture and number of parameters and (2) its demonstrated image classification performance, performing in the top 1% accuracy on ImageNet [34,36].

Data sets
For accurate comparison, we follow the procedure described in [9], using images from the first IEEE IFSTC Image Forensics Challenge as described by [37]. The portion of the data set used consists of 3334 images of size 1024 × 768, which was further split into training, validation and testing data. The images are captured from several different digital cameras of both indoor and outdoor scenes.

Printing and scanning
We used three different printers and one scanner to create a data set of printed and scanned images: one Dell S3845CDN Laser Multifunction Printer, one Xerox Altalink C8070 Multifunction Printer, and one Xerox WorkCentre 7970 Multifunction Printer, which we refer to as Dell, Xerox1 and Xerox2 respectively hereafter. We printed 50 images of each manipulation type on each printer and used the Dell scanner to scan each image (see Fig. 2). After scanning and extracting the images from the resulting pdfs, the image sizes were 1700 x 2200 pixels, which was then center-cropped to 1536 × 1792 to remove the white border added by the scanning process. Each image was then split into 42 299 × 299 blocks (or 256 × 256 blocks for Bayar2016), resulting in 2142 image blocks of each class from each printer (see Fig. 3). We limited our data creation to 900 full-page color images both for budget constraints and environmental concerns; creating a synthetic data set through printing and scanning simulation may be an avenue of future work.

Manipulations
Again following the procedure described in [9], we manipulated each image with each of six manipulation types: Additive White Gaussian Noise (AWGN), Gaussian . We note that there is significant variation between the two images, similar to that introduced by the global manipulation methods with which we experimented . We note that due to the algorithms employed, JPEG compression and resampling might be reasonably similar to the printing and scanning process. For this reason, we additionally train and evaluate the models on a restricted set of four classes only, excluding JPEG and bilinear resampling. See Table 1 for details on the parameters used for each manipulation • Additive white Gaussian noise constructs a noise matrix of the same shape as the image according to a normal distribution with a given sigma value and adds this matrix to the original image. The result is then normalized to values between 0 and 255.
• Gaussian blurring blurs the image using a Gaussian filter by convolving the input image using a given kernal. • JPEG compression is a lossy compression method which compresses the image through converting the color map, down-sampling and Discrete Cosine Transform (DCT). • Median filtering replaces each pixel with the median value of the neighboring pixels using a given kernal area. • Bilinear resampling works similarly, resizing the image using the distance-weighted average of the neighboring pixels to estimate the new pixel value.
See Table 1 for manipulation parameter details.

Table 1 Parameter specifications for each manipulation type
We used the same parameters as in [9]

Experiments
We trained each model (our proposed model, Bayar2016, and Xception) on a variety of training sets and evaluated each trained model on multiple validation data sets (see Table 2). We first investigated the extent to which our selected models can correctly classify the validation images after printing and scanning. We trained each model on the original data set (before printing and scanning) with all six classes: Additive White Gaussian Noise (AWGN), gaussian blurring (GB), JPEG compression (JPEG), Median Filtering (MF), Bilinear Resampling (RS) and Pristine or no manipulation (PR). For a more complete analysis, we removed the Bilinear Resampling (RS) and JPEG compression (JPEG) classes from the training and validations sets and retrained the models, as these two classes could intuitively be considered similar to changes introduced during the printing and scanning process (see Table 3).
Second, we explored countering this vulnerability by training on the printed and scanned image blocks [13]. We trained each model on the printed and scanned image blocks from a single printer. The data set (see Table 2, Xerox1) consists of 50 full images (1700 × 2200 pixels), which were then divided into 299 × 299 for our proposed model and Xception, and 256 × 256 for the Bayar2016 model. This resulted in 2142 image blocks for each data set, which was divided into training and validation sets of size 1722 and 420 respectively, using only the central images to avoid including border artifacts from the scanning process.
Third, we created composite data sets, one consisting of all printed and scanned image blocks (from all three printers), and the other consisting of all printed and scanned image blocks as well as a number of image blocks from the original data set (before printing and scanning), at a size equivalent to those from one of the three printers. The first composite data set, which we refer to as Composite Printers, consists of 6426 image blocks (printed and scanned only), while the second consists of 8568 image blocks (75% printed and scanned, 25% original). The goal of this experiment was to evaluate if the poor accuracy fitting the printed and scanned data could be mitigated by dramatically increasing the size of the training set.
Finally, we evaluated the performance of each of the models on identifying the printer of printed and scanned images (see Table 2, Printer Identification).

Hyperparameters
For Bayar2016, we used a batch size of 64, an initial learning rate of 0.01, stochastic gradient descent (SGD) with momentum 0.95, weight decay 0.0005, gamma 0.7, and step size 6.
We used similar hyperparameters for Xception and our proposed model. Specifically, for both models, we use the pre-trained weights from the network as trained on Ima-geNet. We again used SGD, and inferred the batch size and learning rate at training time based on the number of GPUs, using batch_size = 4 × num_gpus for the batch size and 0.01 for the initial learning rate. We use momentum 0.9 and weight decay 0.0005. For learning rate decay, we use polynomial decay as described in [38]. For each model, we trained until the validation accuracy plateaued or began to fall.  Table 2, Xerox1). We note that despite the high reported validation accuracy on the original data set, the model struggles to distinguish between the classes after printing and scanning  Table 2). We investigate the model's performance after removing bilinear resampling (RS) and JPEG compression, but find that it still performs little better than random Following the original methodology for Bayar2016, we retain only the green color layer of each image and divide into 256 × 256 non-overlapping blocks, retaining nine central blocks. For our proposed model and for Xception, we retain all three color channels and split the images into 299 × 299 non-overlapping blocks, according to the input size of the original architecture (Figs. 4, 5).

Print-scan manipulation
To evaluate the general vulnerability of each of the models to printed and scanned images, we trained on the original IFSTC data set (before printing and scanning) and evaluated each model on validation sets from each of the three printers. When we evaluated the models on the printed and scanned validation sets, we found that each model performed only slightly better than random.
We additionally removed the bilinear resampling (RS) and JPEG compression classes, and found that the resulting models are similarly unable to correctly classify the remaining four manipulations, still performing at or below random. We Table 3 Validation accuracy for various validation sets after training on IFSTC data set We note that although all three models perform exceptionally well on the original IFSTC data set, each performs little better than random when evaluated on images from any of the three printers. Because JPEG compression and Bilinear Resampling (RS) could be reasonably inferred to be similar to printing and scanning, we remove these classes and train and evaluate on a restricted set of four classes (4c) (see Sect. 4). Despite this restricted set of manipulations, however, the models perform no better than random  Table 4 Validation accuracy for various validation sets after training on Xerox1 data set (see Table 2, Xerox1) Bold value refer to models that perform better than the rest -to highlight the model performance -its a common practice to do this and usually helps improve readability We trained each model on images from only the Xerox1 data set, or images after being printed and scanned on the first Xerox printer. We find that while no model is able to perfectly fit the printed and scanned data set, our proposed models significantly outperforms the current state-of-the-art models. We also note that transferability to other printers remains weak, indicating significant variance between the printers. additionally note that the models perform worse on the printed and scanned validation images than on the validation images after JPEG compression, a known vulnerability of these types of models, indicating that printing and scanning may be more effective at masking the manipulations [17] (see Table 3).

Cross-training on printed and scanned examples
We additionally trained each model on printed and scanned images from an individual printer (Xerox1) (see Sect. 3.6). We note that Bayar2016 and Xception achieve accuracies 66.6% and 70.4% respectively, while our proposed model is able to achieve an accuracy of 75.3%. It also appears that training on one printer does not lend itself to similar validation accuracy on examples from another printer, even of the same make (see Table 4).

Composite training
To compensate for the small size of the data set for each printer alone, we created a composite data set, consisting of all of the printed and scanned examples (total size 6426 blocks), which we refer to as Composite Printers. However, we found that training on this composite data set did not improve validation performance on any single printer compared with training on images from that printer alone. While this is possibly due to a still insufficiently Table 5 Validation accuracy for various validation sets after training on the composite printers data set Bold values refer to models that perform better than the rest -to highlight the model performance -its a common practice to do this and usually helps improve readability One possible explanation for the poor validation accuracy on a single printer could be the small size of the data set. To investigate this, we combine the images from all three printers for training, but note that performance on a single printer does not improve. Here 4c indicates that we used the restricted set of manipulations (AWGN  large training data set, it also likely provides further evidence that the difference between printers and scanners may be significant enough to preclude fitting a general printed and scanned data set (see Table 5). For completion, we additionally created another composite data set, which we refer to as Composite Full, which consists of the same composition as Composite Printers plus an equivalent number of examples from the original data set (total size 8568), and found similar results (see Table 6).

Printer identification
For comparison with work on camera model identification, we additionally experimented with printer identification on each of the three printers using the discussed models, and found that the models could distinguish between images from the printers with up to 95% accuracy. This is particularly impressive considering the accuracies were achieved using a relatively small set of training data (2410 image blocks) and without any additional metadata (see Table 7), indicating significant variance between the artifacts introduced by each printer [15].

Conclusions
We investigated the robustness of current state-of-the-art image manipulation detection models in the context of printing and scanning, and found that these models perform poorly on printed and scanned image data. We proposed a model architecture which performs 5% better than the state-of-the-art models when trained and evaluated on images from a single printer. We constructed a data set of over 6000 printed and scanned image blocks which we plan to release to the community for further investigation.
That current state-of-the-art models are vulnerable to printing and scanning is an important finding given the availability and ease of printing and scanning images versus constructing complex adversarial examples.
Further analysis suggest that the variability between images produced by each printer is large, significant enough for the models to easily distinguish between printers and for models trained on a single printer to generalize poorly to images from another printer. This conclusion may create additional challenges in designing models robust to printing and scanning, and sets it apart from work on creating models robust to more uniform and predictable JPEG compression. Future work may include developing methods to simulate printing and scanning in order to create a larger data sets for training the models.
Abbreviations AWGN: Additive White Gaussian Noise; Bayar2016: Model described in [9]; CNN: Convolutional neural network; Dell: Dell S3845CDN laser multi-function printer; FGSM: Fast gradient sign method; GAN: Generative adversarial network; GB: Gaussian blurring; GPU: Graphics processing unit; IFSTC: IEEE IFSTC image forensics challenge data set as described in Table 7 Validation accuracy for printer identification by model We investigate the variation of the images between printers by training each model to discern between printers. The high accuracy indicates that the images produces by each printer vary significantly