Assessment Framework for Deepfake Detection in Real-world Situations

Detecting digital face manipulation in images and video has attracted extensive attention due to the potential risk to public trust. To counteract the malicious usage of such techniques, deep learning-based deepfake detection methods have been employed and have exhibited remarkable performance. However, the performance of such detectors is often assessed on related benchmarks that hardly reflect real-world situations. For example, the impact of various image and video processing operations and typical workflow distortions on detection accuracy has not been systematically measured. In this paper, a more reliable assessment framework is proposed to evaluate the performance of learning-based deepfake detectors in more realistic settings. To the best of our acknowledgment, it is the first systematic assessment approach for deepfake detectors that not only reports the general performance under real-world conditions but also quantitatively measures their robustness toward different processing operations. To demonstrate the effectiveness and usage of the framework, extensive experiments and detailed analysis of three popular deepfake detection methods are further presented in this paper. In addition, a stochastic degradation-based data augmentation method driven by realistic processing operations is designed, which significantly improves the robustness of deepfake detectors.


Introduction
In recent years, the rapid development of deep convolutional neural networks (DCNNs) and ease of access to large-scale datasets have led to significant progress on a broad range of computer vision tasks and meanwhile created a surge of new applications.For example, the recent advancement of generative adversarial networks (GANs) [1][2][3] has made it possible to generate realistic forged contents that are difficult for humans to distinguish from their authentic counterparts.In particular, current deep learning-based face manipulation techniques [4][5][6][7] are capable of changing the expression, attributes, and even identity of a human face image, the outcome of which refers to the popular term 'Deepfake' .The recent development of such technologies and the wide availability of open-source software has simplified the creation of deepfakes, increasingly damaging our trust in online media and raising serious public concerns.To counteract the misuse of these deepfake techniques and malicious attacks, detecting manipulations in facial images and video has become a hot topic in the media forensics community and has received increasing attention from both academia and businesses.
Nowadays, multiple grand challenges, competitions, and public benchmarks [8][9][10] are organized to assist the progress of deepfake detection.At the same time, with the advanced deep learning techniques and large-scale datasets, numerous detection methods [4,[11][12][13][14][15][16] have been published and have reported promising results on different datasets.But some studies [17,18] have shown that the detection performance significantly drops in the cross-dataset scenario, where the fake samples are forged by other unknown manipulation methods.Therefore, cross-dataset evaluation has become an important step in recent studies to better show the advantages of deepfake detection methods, encouraging researchers [19][20][21] to propose detection methods with better generalization ability to different types of manipulations.
Nevertheless, another scenario that commonly exists in the real world has received little attention from researchers.In fact, it has long been shown that DCNN-based methods are vulnerable to real-world perturbations and processing operations [22][23][24] in different vision tasks.In more realistic conditions, images and video can face unpredictable distortions from the extrinsic environment, such as noise and poor illumination conditions, or constantly undergo various processing operations to ease their distribution.In the context of this paper, a deployed deepfake detector could mistakenly block a pristine yet heavily compressed image.On the other hand, a malicious agent could also fool the detector by simply adding imperceptible noise to fake media content.To the best of our acknowledgment, most of the current deep learning-based deepfake detection methods are developed based on constrained and less realistic face manipulation datasets, and therefore, they are not robust enough in real-world situations.Similarly, the conventional assessment approach, which exists in various benchmarks, often directly samples test data from the same distribution as training data and can hardly reflect model performance in more complex situations.In fact, most of the existing deepfake detection methods only report their performance on some well-known benchmarks in the community.
Therefore, a more reliable and systematic approach is desired firsthand to assess the performance of a deepfake detector in more realistic scenarios and further motivate researchers to develop robust detection methods.In this paper, a comprehensive assessment framework for deepfake detection in real-world conditions has been conceived for both image and video deepfakes.Notably, the realistic situations are simulated by applying common image and video processing operations to the test data.The performance of multiple deepfake detectors is measured under the impact of various real-world processing operations.At the same time, a generic approach to improve the robustness of the detectors has been proposed.
In summary, the following contributions have been made.
• A realistic assessment framework is proposed to evaluate and benchmark the performance of learning-based deepfake detection systems.To the best of our knowledge, this is the first framework that systematically evaluates deepfake detectors in realistic situations.
• The performance of several popular deepfake detection methods has been evaluated and analyzed with the proposed performance evaluation framework.The extensive results demonstrate the necessity and effectiveness of the assessment approach.• Inspired by the real-world data degradation process, a stochastic degradation-based augmentation (SDAug) method driven by typical image and video processing operations is designed for deepfake detection tasks.It brings remarkable improvement in the robustness of different detectors.• A flexible Python toolbox is developed and the source code of the proposed assessment framework is released to facilitate relevant research activities.
This article is an extended version of our recent publication [25].The additional contents of this paper are summarized as follows.
• More recent deepfake detection methods have been summarized and introduced in the related work section.• The proposed assessment framework has been extended to support the evaluation of video deepfake detectors.• The performance of two current state-of-the-art deepfake detection methods has been additionally evaluated using the assessment framework.• More substantial experimental results have been presented to better demonstrate the necessity and usage of the assessment framework.The performance and characteristics of four popular deepfake detection methods are analyzed in depth based on the assessment results.• The impact of different image compression operations on the performance of deepfake detectors is additionally studied in detail.• More experiments, comparisons, and cross-manipulation evaluations have been conducted for the proposed stochastic degradation-based augmentation method.Its effectiveness and limitations are further analyzed.

Deepfake detection
Deepfake detection is often treated as a binary classification problem in computer vision.Early on, solutions based on facial expressions [26], head movements [27] and eye blinking [28] were proposed to address such detection problems.In recent years, the primary solution to this problem is by leveraging advanced neural network architectures.Zhou et al. [29] proposed to detect deepfakes with a two-stream neural network.Rössler et al. [4] retrained an XceptionNet [30] with manipulated face dataset which outperforms their proposed benchmark.Nguyen et al. [11] combined traditional CNN and Capsule networks [31], which require fewer parameters.Some video deepfake detectors [32][33][34] leveraged recurrent convolutional neural networks to track forgery clues from the temporal sequences.Other creative attempts in network architectures include, but are not limited to, multi-task autoencoders [35,36], efficient networks [21,37] and vision transformers [38,39].In addition, the attention mechanism, a well-known technique to highlight the informative regions, has also been applied to further improve the training process of the detection system.
Dang et.al [40] proposed a detection system based on an attention mechanism.Zhao et al. [12] designed multi-attention heads to predict multiple spatial attention maps.Their proposed attention map can be easily implemented and inserted into existing backbone networks.Besides focusing on the spatial domain, recent work [13][14][15][16]41] attempts to resolve the problem in the frequency domain.The theory behind them is based on the fact that current popular GAN-based image manipulation methods often introduce low-frequency clues due to the built-in up-sampling operation.These methods transform the image to the frequency domain via DCT transformation and separate information according to different frequency bands.As a result, the forgery traces are more effectively captured.
To tackle the generalization problem, one important branch of work directly trains models with fully synthetic data, which forces the models to learn more generic representations for deepfake detection.For example, Xray [42] and SBIs [21] methods manually generate blended faces during the training process as fake samples, which reproduce the blending artifacts existing in real-world GAN-synthesized deepfakes.Both methods have achieved remarkable performance and notable generalization ability to certain types of manipulation methods.But as explained by the authors, these methods are susceptible to many common perturbations, such as low-resolution and heavy compression.In this paper, four different types of deepfake detectors [4,11,21,39] are adopted for experiments.

Deepfake detection competitions review
To assist in faster progress and better advancement of deepfake detection tasks, numerous large-scale benchmarks, competitions, and challenges [4,[8][9][10] have been organized, the results of which have been made publicly available.Meta partnered with some academic experts and industry leaders and created the Deepfake Detection Challenge (DFDC) [8] in 2019.The competition provided a large incentive, i.e. 1 million USD, for experts in computer vision and deepfake detection to dedicate time and computational resources to train models for benchmarking.More recently, the Trusted Media Challenge (TMC) [10] was organized by AI Singapore with a total prize pool of up to 500k USD to explore how artificial intelligence technologies could be leveraged to combat fake media.Nevertheless, after a thorough investigation of the benchmarking results, a new question emerges: Can the assessment approach adopted by the competitions reflect their performance in realistic scenarios?Although both challenges tried to simulate real-world conditions by preprocessing part of the testing data with some common video processing techniques, they do not really differentiate the detectors.As shown in Table 1, the final results of the top-5 prize winners from DFDC [8] are extremely close and the ranking seems to be easily affected by some random noise, for example simply taking out a few fake samples or adding slightly more severe blurriness effect.
The current ranking approach in these competitions is not reliable.A more rigorous framework is introduced in this work, which is able to differentiate the detectors in multiple dimensions, i.e. general performance, general robustness in realistic conditions, and robustness to specific impacting factors.

Robustness benchmark
In recent years, research has been conducted to explore the robustness of CNN-based methods toward real-world image corruption.Dodge and Karam [22] measured the performance of image classification models with data disturbed by noise, blurring, and contrast changes.In [48], Hendrycks et al. presented a corrupted version of ImageNet [49] to benchmark the robustness of image recognition models against common image manipulations.[50][51][52] focused on a safety-critical task, autonomous driving, and provided a robustness benchmark for various relevant vision tasks, such as object detection and semantic segmentation.Similar work has been done for face recognition tasks, [23,24,53,54] analyzed the robustness of CNN-based face recognition models toward face variations caused by illumination change, occlusion, and standard image processing operations.In the media forensics community, StirMark [55] tested the robustness of image watermarking algorithms.The ALASKA#2 dataset [56] was created following a careful evaluation of ISO parameters, JPEG compression, and noise level on FlickR images, etc., to help researchers in designing way more general and robust steganographic and steganalysis methods.It is worth noting that two popular deepfake detection benchmarks, DFDC [8] and Deeperforensics − 1.0 [9] also adopted standard processing operations to part of the testing data.They randomly applied distortions to a small portion of test data and considered only one severity level for each processing operation.However, the way they evaluate a detector's robustness is not systematic enough.The assessment results cannot rigorously show to which extent the detector is affected by the distorted data, nor help identify which factors show more significant influence on the detector's performance.There is a lack of a fair and flexible methodology that systematically compares the performance of deepfake detectors in realistic situations.In this work, a new assessment framework is introduced to solve this problem.

Proposed assessment framework
Nowadays, deepfakes are distributed on the internet in both image and video formats.Some of the detection methods are targeted for both cases, while others are specially designed for one type of deepfakes.The proposed assessment framework is designed in a way that the performance of a deepfake detector can be evaluated under either image or video scenarios.
In the context of this paper, the main difference between the two scenarios resides in the real-world processing operations applied to the test data.In specific, in image Table 1 Deepfake Detection Challenge (DFDC) [8] top-5 prize winners and their corresponding results

Team name
Overall log loss Selim Seferbekov [43] 0.4279 WM [44] 0.4284 NTechLab [45] 0.4345 Eighteen Years Old [46] 0.4347 The Medics [47] 0.4371 scenarios, we first extract frames from video and treat them as image deepfakes.The image processing operations are then applied to the forgery images.In video scenarios, we treat them as video deepfakes and directly apply video processing operations to the fake video.In this section, the common realistic influencing factors and processing operations for image and video deepfakes are first introduced respectively.Then, the proposed assessment framework is described to provide a fair comparison for deepfake detectors under more realistic situations.

Realistic influencing factors for image deepfakes
In a real-world situation, the images are often processed by various digital image processing operations before being distributed.In more adverse cases, malicious deepfakes can be slightly corrupted to fool the detector while maintaining good perceptual quality.It is still unknown to which extent the popular deepfake detectors are able to make correct predictions.In this context, the most prominent factors have been considered in the assessment framework.
In general, the framework contains six categories of image processing operations or corruptions with more than ten minor types.Each type consists of multiple severity levels.The details of all operations used in evaluations are described below with the Noise: Noise is a typical distortion especially when images are captured in a lowillumination condition.To simulate the noise, an Additive White Gaussian Noise (AWGN) is applied to the data and the pixel values are clipped to [0, 255].In this paper, the variance value σ is selected in a range from 5 to 50.In addition, Poisso- nian-Gaussian noise [57] is also included to better reflect the realistic noise levels, whose parameters are learned from a group of real noisy pictures.
Resizing: Resizing is one of the most commonly used image processing operations.It refers to changing the dimensions of the media content to fit the display or other purposes.On the other hand, the resizing operation, more specifically the down-sampling operation, can significantly reduce the performance of modern deep learningbased detectors [58,59] due to a loss of discriminative information.This is often the case for those earlier image contents that are of poor quality.In this framework, the impact of resizing operation is simulated by first downscaling the images and then upscaling back using bicubic interpolation.
Image compression: Lossy compression refers to the class of data encoding methods that remove unnecessary or less important information and only use partial data to represent the content.These techniques are used to reduce data size for efficient storage and transmission of content and are widely applied to image processing.In this framework, the JPEG compression artifacts are applied and the impact of different quality factors, i.e. from 30 to 95, on the deepfake detection system is evaluated.As deep learning-based compression techniques are becoming increasingly popular in this community, two AI-based image compression techniques [60,61] are also considered in this framework with multiple compression qualities to choose from.
Denoising: A typical way to reduce noise is by smoothing, which is a low-pass filtering applied to the image.The denoising operation is often applied to image contents after being acquired by the camera but at the same time, it tends to blur the media content and results in a reduction of details, which is harmful to the detection system.To measure the impact of the denoising operation, the blurriness effect is simulated in our framework by applying Gaussian filters with kernel size σ ranging from 3 to 11.Meanwhile, learning-based denoising techniques are gradually deployed in practice.They recover a noisy image with higher quality but often bring unpredictable artifacts.The impact of applying the DnCNN technique [62] is assessed in the framework.
Enhancement: In realistic conditions, the image data captured in the wild can suffer from poor illumination.Image enhancement is frequently used to adjust the media content for better display.In this assessment framework, the contrast and brightness of the test data are modified by both linear and nonlinear adjustments.The former simply adds or reduces a constant pixel value while the latter applies gamma correction.
Combinations: It is even more common that the media content suffers from multiple types of distortions and processing operations.Therefore, the mixture of two or three operations above is also considered, such as combining JPEG compression and Gaussian noise, making the test data better reflect more complex real-world scenarios.

Realistic influencing factors for video deepfakes
Face forgeries by deepfake technology are spread over the Internet not only in the form of images but also as video.The processing operations and various video effects are very common on different social media, smartphone applications, and streaming platforms.Their impact on the accuracy of detection methods should not be neglected.
The framework includes seven categories of video processing operations with commonly used parameters.The illustrative example of testing data is shown in Fig. 2. The factors are also described in detail as follows.
Video compression: Similar to images, uncompressed raw video requires a large amount of storage space.Although lossless video compression codecs can perform at a compression factor of 5 to 12, a typical lossy compression video can achieve a much lower data rate while maintaining high visual quality.In fact, compression technologies for video provide the basis for the distribution of video worldwide.The potential deepfake video propagates among social networks after being compressed several times.However, the possible side effect of lossy compression artifacts on deep learning-based detectors has not been sufficiently studied.It is necessary to test the robustness of a deepfake detector on compressed authentic and deepfake video.In this context, the proposed assessment framework consists of test data compressed by H.264 codec using the FFMPEG toolbox with two constant rate factors, namely 23 and 40.Flip: Flipping a video horizontally describes the creation of a mirror video of the original footage.It is a very common video editing method that prevents video cuts from disorienting the viewer.But whether and to which extent the flipping operation can affect a deepfake detector has not been evaluated before.On the other hand, the vertical flipping operation is one of the easiest ways to fool a detector.In fact, most current detectors will not adjust or correct the face pose during preprocessing step.Hence, one can simply upload a flipped video to avoid being detected while it is still readable to a human.
Video filters: In recent years, video filters have become popular on social media.They are preset treatments included in many video editing apps, software, and social media platforms, providing easy access for users to alter the look of a video clip.Some common types of video filters include color filters, beauty filters, stylization filters, etc.The overall color palette of a deepfake video can be changed by a video filter on social media, making it an out-of-distribution sample from common deepfake databases.In the proposed assessment framework, two typical filters, 'Vintage' and 'Grayscale' , are considered.
Brightness: Brightness is a measure of the overall lightness or darkness of a video.Adjusting the brightness of a video can affect the way that colors are perceived, as well as the visibility of details and textures.For example, increasing the brightness can make it easier to see details in shadows, while decreasing the brightness can obscure details in highlights.In real-world conditions, the brightness of a video is often adjusted to create a different sense of style of a video.The assessment framework takes this situation into consideration and measures the performance of a detector under different brightness conditions.More specifically, the 'Lighten' and 'Darken' commands in the FFMPEG toolbox are applied to the testing video, respectively.
Contrast: Contrast refers to the difference between the lightest and darkest areas of a video.Similar to brightness, adjusting contrast is one of the most common operations to change the visual appearance of a video.The 'Contrast' command in the FFMPEG toolbox is employed to increase the contrast of the testing video.
Noise: Similar to images, video noise is a common problem in video clips shot in lowlight conditions or with small-sensor devices such as mobile phones.It often appears as annoying grains and artifacts in the video.Gaussian noise with a temporal variance but fixed strength is applied to the video data.
Resolution: Resolution refers to the number of pixels in a video.There is an important trade-off between the resolution and file size.Decreasing the resolution of a video will generally result in a low-quality video with fewer details to be displayed on the screen.But it can also reduce the file size, which makes it easier to store and share.On the other hand, the resolution change can also affect the ratio of width to height of the video.The performance of the deepfake detector when facing low-resolution or stretched video will be evaluated by the proposed framework.

Assessment methodology
Current deepfake detection algorithms are based on deep learning and rely heavily on the distribution of the training data.These methods are typically evaluated using a test dataset that is similar to the training sets.Some benchmarks, such as [8,9], attempt to measure the performance of deepfake detectors under more realistic conditions by adding random perturbations to partial test data and mixing up with others.However, there is no standard approach for determining the proportion or strength of these perturbations, which makes the results of these benchmarks more stochastic and less reliable.
The assessment methodology proposed in this paper aims to more thoroughly measure the impact of various influencing factors, at different severity levels, on the performance of deepfake detection algorithms.In this section, the principle and usage of our assessment framework are introduced in detail.First, the deepfake detector is trained on its original target datasets, such as FaceForensics++ [4].The processing operations and corruptions in the framework are not applied to the training data.Then, as illustrated in Fig. 3, multiple copies of the test set are created, and each type of distortion at one specific severity level is applied to the copies independently.The standard test data together with different distorted data are fed to the deepfake detector respectively.Finally, the detector generates "real or fake" predictions.During the entire evaluation, the true positive rate (TPR) and false positive rate (FPR) are measured by constantly comparing the detector's predictions and the binary ground-truth labels.The ROC curve is plotted and the Area Under the Curve (AUC) score is reported as the final metric.An overall evaluation score can be obtained by averaging the scores from each distortion style and strength level to report the general performance of a tested detector.Besides, the computed metrics can also be grouped by each operation category to further analyze the robustness of one deepfake detector on a specific processing operation.
In addition, to relieve the burden on storage caused by the multiple copies of the test set, a Python toolbox is developed to address this problem in an online manner, which hard-codes the digital processing operations and makes the strength level a parameter.It operates in the same format as the famous Transforms module in the TorchVison toolbox and can be easily integrated into the evaluation process.

Stochastic degradation-based augmentation
To improve the ability of deepfake detection methods to handle realistic distortions and pre-processing operations, an effective data augmentation approach is proposed which leads to a robustness improvement.
Standard data augmentation methods often introduce geometric and color space transformation to enrich training data and improve the model generalization ability.But according to our experiments, this type of augmentation technique is less effective for deepfake detection under realistic conditions.
Motivated by a typical data acquisition and transmission pipeline in the real world, the stochastic degradation-based augmentation (SDAug) method is proposed.The main novelty of the proposed augmentation technique resides in the fact that it is driven by the typical operations that images and video are subject to in realistic conditions.Based on the observation of the data degradation process, a carefully designed augmentation chain is conceived, which allows the training data to better resemble real-world conditions and further boosts the performance of deepfake detection methods.
Generally, the brightness and contrast of input image x are first modified by image enhancement operator enh.Afterward, the image is convoluted with an image blurring kernel f, followed by additive Gaussian noise n.In the end, JPEG compression is applied to obtain the augmented training data x aug .The augmentation chain is described by the following formula.
In addition, unlike the common data augmentation process, the SDAug method is implemented in a stochastic manner.The term 'stochastic' can be interpreted in the following two aspects.First, each aforementioned augmentation operation will occur with a certain probability in the augmentation chain.Second, each operation will use a random severity level for every frame.The realistic scenario is rather complex and does not necessarily consist of multiple types of distortions and processing operations.A random mixture of several distortions and severity levels can create more diversity in the augmented training data.Moreover, stochastic augmentation helps preserve more information from the original training data and therefore prevents accuracy loss on the high-quality data.In detail, the augmentation operations are explained in sequence as follows.
Enhancement: The augmentation chain begins with an image enhancement operation.A probability of 50% is adopted to apply either a brightness or a contrast operation on the training data which will be then non-linearly modified by a factor randomly selected from [0.5, 1.5].
Smoothing: Image blurring operation is then applied with a selected probability of 50%.Either Gaussian blur or Average blur filter is used with a kernel size varying in the range [3,15].
Additive Gaussian noise: For each batch of training data, a probability of 30% is adopted to add a Gaussian noise.The standard deviation of the Gaussian noise varies randomly in the interval [0, 50]. (1) JPEG compression: Finally, JPEG compression is applied with a selected probability of 70%.The quality factor corresponding to the compression is randomly chosen in the range [10,95].

Experimental results
In this work, numerous experiments have been conducted to demonstrate the effectiveness and usage of the proposed assessment framework.The experimental setup will be described at the beginning of this section, followed by the substantial assessment results and analysis for both image and video scenarios.Then, the impact of three image compression technologies on deepfake detectors is further discussed as an example of the multiple applications of the framework.In the end, the effectiveness of the proposed augmentation technique is reported and analyzed.

Datasets
Two widely used face manipulation datasets are selected in this paper for extensive experimentation.For both datasets, there is a strict split up in the dataset suggested by the dataset provider and the video used for training will not appear in the validation and testing stages.
FaceForensics++ [4], denoted by FFpp, contains 1000 pristine and 4000 manipulated video generated by four different deepfake creation algorithms.In addition, raw video contents are compressed with two quality parameters using the AVC/H.264codec, denoted as C23 and C40.In the experiments, the training set is denoted as FFpp-Raw, FFpp-C23, and FFpp-C40 when the model is trained on single-quality-level data, while it is denoted as FFpp-Full when data of all three quality levels are involved for training.On the contrary, to provide a fair baseline, only uncompressed data are used for the final assessment.
Celeb-DFv2 [63] is another high-quality dataset, with 590 pristine celebrity video and 5639 fake video.The test data are selected as recommended by [63] while the rest are left for training purposes, where the training and validation sets were split into 80% and 20% accordingly.

Detection methods
Experiments have been conducted with the following learning-based deepfake detectors, all of which have reported excellent performance on popular benchmarks.
Capsule-Forensics is a deepfake detection method based on a combination of capsule networks and CNNs.The capsule network was initially proposed by [31] to address some limitations of CNNs and it used a rather smaller amount of parameters than traditional CNN to train very deep neural networks.[11] employed the capsule network as a component in a deepfake detection pipeline for detecting manipulated images and video.This method achieved the best performance at that time in the FaceForensics++ dataset compared to its competing methods.
XceptionNet [30] is a popular CNN architecture in many computer vision tasks and has been used to detect face manipulations when it works as a classification network.Rössler et al. [4] first adopted it as a baseline in the FaceForensics++ benchmark along with three other approaches.The detection system based on XceptionNet architecture was first pre-trained using ImageNet database [49] and then re-trained on a specific dataset for the deepfake detection task.It achieved excellent performance in the FaceForensics++ benchmark on both compressed and uncompressed contents and has become a popular baseline method for recent deepfake detection approaches.
SBIs [21] refers to a data synthetic method, Self-blended Images, which is specially designed for deepfake detection tasks.This method aims to generate hardly recognizable fake samples that contain common face forgery traces to encourage the model to learn more general and robust representations for face forgery detection.The overall detection system is based on a pre-trained deep classification network, EfficientNet-b4 [64].After retraining with the SBIs technique, the detector demonstrates an impressive generalization ability to different unseen face manipulations and achieves the current state-of-theart in cross-dataset settings.But its robustness to common image and video processing operations has not been measured.
UIA-VIT [39] detects face forgery using vision transformer technique.This approach jointly trains an end-to-end pipeline that both classifies the deepfake images and estimates the modification areas in an unsupervised manner.Overall, the UIA-VIT method focuses on intra-frame inconsistency without pixel-level annotations and achieves stateof-the-art performance regarding generalization ability.

Training details
The Capsule-Forensics, XceptionNet, and UIA-VIT methods are trained with Adam optimizer with β 1 = 0.9 , β 2 = 0.999 .Following the hyper-parameters suggested in the original paper, the Capsule-Forensics model is trained from scratch for 25 epochs with a learning rate of 5 × 10 −4 , the XceptionNet model is trained for 10 epochs with a learn- ing rate of 1 × 10 −3 , and the UIA-VIT model is trained for 8 epochs with a learning rate of 3 × 10 −5 .During training, 100 frames are randomly sampled from each video in the training set.For evaluation and testing, 32 frames are extracted from the video in the validation and test set.Extracted frames are pre-processed and cropped around the face regions using the dlib toolbox [65].The face regions are finally resized into 300x300 pixels before feeding to the network.
The SBIs method has a different experimental setting from the previous three methods.It is retrained with SAM [66] optimizer for 100 epochs.The batch size and learning rate are set to 32 and 1 × 10 −3 , respectively.During the training phase, only authentic high-quality video is used and the corresponding fake samples are created by their proposed self-blending method.

Performance metrics
During the evaluation, the Area Under Receiver Operating Characteristic Curve (AUC) is used as a metric in all experiments.

Assessment results on realistic image deepfakes
In this section, the performance of the Capsule-Forensics, XceptionNet, and UIA-VIT methods is measured when facing more realistic image deepfakes produced by the assessment framework.The three deepfake detectors are trained on the original unaltered training sets of FFpp and Celeb-DFv2, respectively.The assessment framework further evaluates the performance of these detectors and summarizes the results as shown in Table 2 and Fig. 4.
In general, our findings draw the following conclusions.First, even mild real-world processing operations can have a noticeable negative impact on detection accuracy.The first two detectors present exceptional performance on unaltered FFpp and CelebDFv2 testing data as expected, but then show severe performance deterioration on all kinds of modified data from the assessment framework, which indicates a lack of robustness.Although UIA-VIT is known for outstanding generalization ability, it also suffers from performance degradation in front of processing operations.
Second, the Capsule-Forensics and XceptionNet methods are prone to be affected by different types of perturbation.When trained on the same high-quality dataset, the Capsule-Forensics method is generally more robust toward JPEG compression, synthetic noise, and gamma correction operation, while XceptionNet at times presents slightly better results that could be of statistical nature.The results from the assessment framework provide valuable guidance toward improving a specific deepfake detector.Moreover, among the considered influencing factors, noise and blurriness effects on images are the most prominent for deepfake detectors.The performance of both detectors deteriorates rapidly after increasing the severity levels of the two distortions.
Finally, the impact of quality variants of training data on learning-based detectors has been analyzed based on the assessment results.When trained only with very highquality data (FFpp-Raw), both the Capsule-Forensics and XceptionNet models will be extremely sensitive to nearly all kinds of realistic processing operations.On the contrary, training the model with relatively low-quality data slightly improves the robustness toward low-intensity processing operations and distortions, but with a cost on the original high-quality testing set.For example, both models trained with compressed data (FFpp-C23, FFpp-Full) show a higher AUC score on our realistic benchmark, but their performance on original unaltered data decreases by 0.5-1%.However, although training with compressed data slightly improves the robustness of UIA-VIT against compression and noise, it brings more negative impact when facing other processing operations.

Assessment results on realistic video deepfakes
In addition to images, the framework provides a comprehensive evaluation for the four detection methods, i.e.Capsule-Forensics, XceptionNet, SBIs, and UIA-VIT on video deepfakes under real-world conditions.Table 3 summarizes the performance of the four deepfake detection methods using the proposed realistic benchmark.
As a result, when trained with high-quality data, both the Capsule-Forensics and XceptionNet methods show a similar trend as in the previous image deepfake detection benchmark and perform poorly when facing pre-processed video deepfakes.The SBIs and UIA-VIT methods outperform the other two detectors and present relatively stable scores in front of most video processing operations, particularly those artifacts introduced by changing brightness or assigning video filters.
However, when the previous two methods are trained directly on compressed data, they maintain higher robustness toward multiple processing operations and even outperform the SBIs method, whose overall score even decreases by 0.66% instead.On the other hand, none of the three methods can properly classify video deepfakes processed by heavy compression, resolution reduction, or video noise.
In addition to benchmarking overall performance, the assessment framework also provides the means to analyze the behavior of a method under one specific realistic situation and help reveal the mechanism behind it.For instance, it is interesting to observe that, regardless of the training data, the SBIs method is more robust to geometric transformation than the other two and retains a good ability to accurately classify a vertically flipped video.It is because the SBIs method is based on local forgery traces instead of the global inconsistency on the face.
While the generalization problem is well-explored by synthetic data-based methods, how to improve robustness toward processing operations and distortions which exist in the real world is still an open question.This paper provides a systematic benchmarking approach that helps reveal the drawbacks of general deepfake detectors.For instance,  although the SBIs method demonstrates a good generalization ability in cross-dataset experiments in their paper [21], our assessment framework shows that it is susceptible to some common perturbations in the real world, such as video compression, video noise, and low resolution.

Impact of different image coding algorithms
The assessment framework additionally provides means to measure the impact of a specific type of processing operation on the performance of a deepfake detector.For instance, image compression operation is almost inevitable during the distribution of a fake image.Meanwhile, AI-based compression technologies have become increasingly popular and are often capable of obtaining relatively smaller bitstreams.However, it is unknown to which extent the learning-based compression algorithms will affect the deepfake detection methods comparonventional JPEG compression.
In this section, a detailed comparison has been made between JPEG compression and two popular AI-based image compression methods, denoted by bmshj [60] and hific [61], respectively.In detail, the Capsule-Forensics and XceptionNet methods are first trained on uncompressed data.Afterward, their performance on different compressed data is evaluated using the framework and is then reported in Fig. 5.As a result, the image compression operation generally brings more negative impact to XceptionNet than to the Capsule-Forensics method.The latter obtains relatively high AUC scores when the test data are compressed by JPEG with high compression factors.Although the bmshj-based compression method is capable of achieving lower bitrates than JPEG, it brings significant negative impact to both detectors, whose predictions are close to random guess regardless of the select compression factor.On the contrary, both tested detectors are more robust to test data compressed using hific codec than using JPEG operation or bmshj codec, even with extremely low bitrates.The results reported in this section imply that hific codec introduces fewer adversarial artifacts, which can interrupt the functionality of other AI-based detectors.

Experimental results with augmentation
Table 4 shows the evaluation results of the Capsule-Forensics and XceptionNet methods trained on the unaltered FFpp dataset together with the proposed augmentation  strategy.The information regarding the models trained with the proposed stochastic degradation augmentation methods is denoted as +SDAug.
In comparison, it is evident that training with the stochastic degradation-based augmentation technique on the same dataset remarkably improves the performance on nearly all kinds of processed data even with intense severity.For example, previous experiments show that the detectors are more vulnerable to synthetic noises and blurry effects.The sub-figures in Figs. 6 and 7 further illustrate the impact of increasing the severity of these distortions on the two detection methods.The data augmentation scheme significantly improves the robustness and meanwhile still maintains high performance on original unaltered data.
It is worth noting that the performance improves not only on the four types of processing operations that appear during data augmentation but also on other different kinds of distortions.As shown in Table 4 and the last two sub-figures in Figs. 6 and 7, both detectors are much more robust toward learning-based compression, low-resolution effects, and other mixed distortions.A similar observation is obtained from the video deepfake assessment framework, see Tables 5 and 6.Although these video processing operations are not present in the proposed augmentation chain, the SDAug technique brings performance improvement to the Capsule-Forensics and XceptionNet methods on nearly all kinds of processed video deepfakes.
To compare with conventional augmentation methods based on geometric and color space transformation, the well-known Augmix [67] augmentation technique is evaluated under the same realistic assessment framework.This method generates multiple augmentation chains that work in parallel by randomly applying transformations to the training data.As a result, Augmix brings limited improvements to the robustness of the detector compared to SDAug, see Table 4. Its overall performance is even worse than simply training with low-quality data, which implies that the traditional data augmentation method is less practical when facing real-world distortions.
To show the effectiveness of the stochastic mechanism, an extra model has been trained using the same degradation-based augmentation chain but without randomness, which means the input data will be processed by all the augmentation operations with a fixed strength level.The corresponding experiment results are also reported in Table 4 and Figs.6, 7, denoted as +DAug.As a result, the models trained with DAug are able to improve the performance on multiple processed data but the AUC scores degrade heavily on the original unmodified data.In comparison, the model trained with SDAug shows more significant robustness improvement and meanwhile maintains high performance on original high-quality data.
Finally, cross-dataset evaluations have been conducted for the Capsule-Forensics and XceptionNet methods to evaluate the generalization ability of those models trained with the proposed augmentation technique.First, the two detectors are trained on the FFpp dataset but tested on the Celeb-DFv1 and Celeb-DFv2 test sets for frame-level AUC scores.The two methods obtain very low scores on the new dataset.In comparison, the proposed augmentation scheme brings a noticeable performance improvement for both detectors on new datasets, showing its capability to improve the generalization ability on unseen forensic face contents.Moreover, we conduct more cross-manipulation experiments on FaceForensics++ which consists of four types of manipulations, namely DeepFakes, Face2Face, FaceSwap, and NeuralTextures.In specific, the Xception-Net model is trained on one type of manipulation and is tested on the remaining three.The results demonstrated in Fig. 8 show that the model trained with SDAug consistently achieves superior generalization performance.

Limitations and Future Work
The experiments carried out in this paper are mainly limited to video deepfakes or standard-quality image deepfakes.The detection of HD single-image deepfakes created by completely different methods, such as GANs, has not been evaluated with the proposed assessment framework.Although preliminary explorations have been done by previous work [68], there have been more advanced techniques recently to create HD singleimage deepfakes, not only by GANs but also by Diffusion Models, and corresponding  detection methods.It would be interesting to extend the assessment framework to be able to study the robustness of state-of-the-art HD image deepfake detectors.On the other hand, although the proposed augmentation technique is in general very helpful in improving the robustness of deepfake detectors when facing various realworld image and video processing operations, some limitations have been observed from the previous results report.First of all, the augmentation chain is hand-designed and the selection of hyperparameters might not be optimal.The proposed augmentation chain could be improved by conducting a parameter search with AutoML technology.Second, according to Table 5, the augmentation method generally provides limited help for SBIs method, because SBIs is entirely based on synthetic data and the augmentation can possibly corrupt the manually designed forgery traces.It could be promising to incorporate our proposed augmentation operations into the forgery data synthesis process to further improve the robustness of detectors based on synthetic forgery data.

Conclusion
Most of the current deepfake detection methods are designed to be as high performing as possible on specific benchmarks.But it has been shown that current assessment and ranking approaches employed in related benchmarks are less reliable and insightful.In this work, a more systematic performance assessment approach is proposed for deepfake detectors in realistic situations.To show the necessity and usage of the assessment framework, extensive experiments have been performed, where the robustness of four popular deepfake detectors is reported and analyzed.Furthermore, motivated by the assessment results, a new data augmentation chain based on a natural data degradation process has been conceived and shown to significantly improve the model's robustness

Fig. 1
Fig. 1 Example of a typical image in the FFpp test set after applying various image processing operations.Some notations are explained as follows.DL-Comp: Deep learning-based compression.GB: Gaussian blur.GN: Gaussian noise.Po-Gau-Noise: Poissonian-Gaussian noise.GammaCorr: Gamma correction.Resize: Reduce resolution.+ : Combination of two operations

Fig. 2
Fig. 2 Example of a typical video frame in the FFpp test set after applying different video processing operations.Some notations are explained as follows.C23 and C40: Video compression using H.264 codec with factors of 23 and 40.Light and Dark: Increase and decrease brightness.Resolution: Reduce video resolution.Hflip and Vflip: Horizontal and Vertical flip

Fig. 3
Fig.3Workflow of the proposed assessment framework.Distortions caused by processing operations are first applied to test data separately.The corresponding predictions by the deepfake detector are compared with the ground-truth label ("real or fake")

Fig. 4
Fig.4 Assessment results of two models trained on FFpp dataset.The suffixes of legends refer to the qualities of the training data.Full means using all available quality data for training

Fig. 5
Fig. 5 Detection performance on data compressed by conventional and AI-based coding algorithms

Fig. 6 6 Fig. 7
Fig.6 Performance comparison between models trained on FFpp-Raw only and trained with the proposed augmentation method

Fig. 8
Fig. 8 Cross-manipulation experiments on FaceForensice++ (Raw) dataset with XceptionNet trained on four different types of manipulated dataset separately, namely Deepfake, Face2Face, FaceSwap, NeuralTextures.AUC (%) scores are compared between the XceptionNet model trained with or without the SDAug technique Lu and Ebrahimi EURASIP Journal on Image and Video Processing (2024) 2024:6 Lu and Ebrahimi EURASIP Journal on Image and Video Processing (2024) 2024:6

Table 2
[60](%) scores of the Capsule-Forensics, denoted as CapsuleNet, XceptionNet, and UIA-VIT methods tested on unaltered and distorted variants of FFpp and Celeb-DFv2 test set respectively.Raw, C23, and Full refer to different quality settings of the FFpp.DL-Comp refers to deep learning-based compression[60]and 'High' refers to the high-quality compressed image

Table 3
AUC (%) scores of four selected deepfake detection methods on the distorted variants of the FFpp test set that are subject to different video processing operations.The notations C23 and C40 here refer to the two different compression rates using AVC/H.264codec.The notation Resolution refers to reducing video resolution by a specific scale

Table 4
AUC (%) scores of cores of the Capsule-Forensics, denoted as CapsuleNet, and XceptionNet methods tested on unaltered and distorted variants of FFppThe suffix +DAug denotes that the model is trained with the proposed augmentation chain but without the stochastic manner.The suffix +SDAug denotes that the model is trained with the stochastic degradationbased augmentation technique.In this table, Bold font denotes the highest score

Table 5
AUC (%) scores of three selected deepfake detection methods trained with the SDAug augmentation method on the distorted variants of the FFpp test set