Multi-attention-based approach for deepfake face and expression swap detection and localization

Advancements in facial manipulation technology have resulted in highly realistic and indistinguishable face and expression swap videos. However, this has also raised concerns regarding the security risks associated with deepfakes. In the field of multimedia forensics, the detection and precise localization of image forgery has become essential tasks. Current deepfake detectors perform well with high-quality faces within specific datasets, but often struggle to maintain their performance when evaluated across different datasets. To this end, we propose an attention-based multi-task approach to improve feature maps for classification and localization tasks. The encoder and the attention-based decoder of our network generate localized maps that highlight regions with information about the type of manipulation. These localized features are shared with the classification network, improving its performance. Instead of using encoded spatial features, attention-based localized features from the decoder’s first layer are combined with frequency domain features to create a discriminative representation for deepfake detection. Through extensive experiments on face and expression swap datasets, we demonstrate that our method achieves competitive performance in comparison to state-of-the-art deepfake detection approaches in both in-dataset and cross-dataset scenarios. Code is available at https://github.com/saimawaseem/Multi-Attention-Based-Approach-for-Deepfake-Face-and-Expression-Swap-Detection-and-Localization.


Introduction
Deepfake techniques have recently achieved significant success due to advances in generative models [1][2][3][4][5].These techniques empower individuals with the ability to manipulate facial features within an image, resulting in the creation of forged faces.The current approaches have the capability to generate high-quality fake content that appears indistinguishable from real media to the human eye.Numerous instances of deepfake have been exploited, particularly in politics and pornography [6,7].This misinformation has caused people to worry about fraud and credibility issues in society.Face (identity) and expression swap are two well-known forms of deepfake face manipulation.Expressionswap or re-enactment techniques enable the transfer of expressions from one person to another while keeping the original subject's identity unchanged.In contrast, identity or face swap involves replacing the face of one person with the face of another individual [8].A well-designed facial expression can effectively convince others to agree with someone's perspective without any verbal communication, and with a deepfake face swap, it becomes possible to portray an individual's physical presence in a particular location where they were not actually present.To effectively combat these deepfakes, the development of robust and reliable face forgery forensics is important to ensure the integrity and ethical standards of multimedia content.
Existing deepfake detection techniques typically frame deepfake as a binary classification task.These approaches heavily rely on deep neural networks (DNN) [9][10][11][12][13][14][15][16][17].Nevertheless, some researchers have explored alternative techniques [18][19][20][21] utilizing hand-crafted features for deepfake detection.However, with the rapid development of deepfake synthesis techniques [4,22,23], the performances of hand-crafted approaches are not satisfactory [8].A common approach among DNN methods involves extracting video frames and using a convolution neural network (CNN) with a fully connected layer for classification.However, these methods overlook correlations between distant positions by focusing on information within each receptive field.As a result, they rely on superficial correlations to differentiate between real and manipulated images.Due to the independent and evenly distributed training-test split, these simplified patterns have a random probability of being effective on unseen test sets, making them susceptible to overfitting.Consequently, their effectiveness is limited to the manipulation methods they were explicitly trained on, and these approaches exhibit significant performance decline when detecting unseen face manipulations.To address this limitation, recent deepfake detection algorithms have incorporated the concept of the attention mechanism into CNNs [24] to enhance both within-dataset and cross-dataset performance by expanding the areas of local image features.Different manipulation methods, such as face swap and expression swap, have unique characteristics and patterns of manipulation, as shown in Fig. 1.These variations in forgery patterns pose a challenge in maintaining similarity among each manipulation method, which may result in overfitting and a decrease in overall performance [25].Recent deepfake generation techniques, like GANs, often employ encoder-decoder architectures in their generators.The decoder incorporates an upsampling design to enlarge the feature maps generated by the encoder, resulting in a colorful image.However, this upsampling process hinders GAN models from accurately reproducing the spectral distributions of real training data [13,26].Consequently, fake images exhibit distinct artifacts in their frequency spectrum, which can be exploited to differentiate them from real images [13].These frequency-related artifacts are commonly observed in various deepfake manipulations, especially in scenarios that involve compression where spatial information is significantly degraded [27].
We hypothesize that by appropriately assessing an image's spatial and spectral information, the network can effectively focus on critical regions for decision-making.Here, we propose an attention-based multi-task learning technique that effectively integrates spatial and spectral information to classify the facial images as real or fake, while simultaneously localizing modified regions within the face, specifically in deepfake facial manipulation subcategories, i.e., face swap and expression swap (face re-enactment), depicted in Fig. 2. Accurate localization of manipulated regions is vital in multimedia forensics for a comprehensive understanding of deepfake forgeries, as high-resolution localization maps provide valuable insights into the specific type of manipulation employed.To address this, we introduce a simple attention-based learning technique to localize potential areas of manipulation.Explicitly localizing these manipulated regions through an attentional mechanism provides two benefits: it suppresses irrelevant information, directing the network's attention to manipulated areas, thereby avoiding disruptions and improving the network's understanding of modified regions.
Our experimental results demonstrate that the proposed attention-based manipulation localization and detection technique significantly improves performance in within-dataset and cross-dataset evaluations.Experimental results on popular deepfake datasets, such as FaceForensics++ [28], CelebDF [8], and DFDC-P [29] demonstrate the competitive performance of our approach compared to state-of-the-art methods.Our contributions can be summarized as follows: • We present the image features learning scheme at local and global levels using a dual attention mechanism (spatial and channel) by jointly integrating convolutional encoder and decoder features to localize pixel-level image forgeries.
Fig. 2 Overview of deepfake detection problem • Our proposed model demonstrates robustness for both cross-dataset and withindataset evaluations by effectively combining frequency and localized spatial features.

Related work
This section provides a concise overview of prior research relevant to detecting and localizing deepfakes.

Manipulated detection
One common deepfake detection approach is to treat a video as a sequence of still images and perform operations on them.Various techniques have been explored, such as capturing unique low-level camera features to detect fake faces [30], estimating inconsistencies in head pose [19], and utilizing flaws in eye-blinking patterns and other facial features for deepfake classification [18,31].However, these methods are not effective in detecting advanced deepfake manipulation techniques.Several deep neural network-based solutions have been developed to differentiate between real and fake faces.These include MesoNet [12], Capsule Network [10], XceptionNet [32], Efficient-Net [33], F 3 Net [16] and GocNet [17].Various features, such as spatial, steganographic, and temporal features [14,15,34,35], as well as frequency dependent cues [36], multiscale Laplacian of Gaussian (LoG) operator [11], and motion features with a fine-grained weighting of inter-class distances [37] have been investigated for deepfake detection.Despite these efforts, challenges persist in detecting realistic deepfakes.Sun et al. [38] introduced Dual Contrastive Learning (DCL) approach to analyze real and fake paired data for deepfake detection.Multi-attention Deepfake Detection (MaDD) [24] presented a framework that captures artifacts using multiple attention maps.However, it lacks strong supervision and struggles to identify minor forgery traces in quality-degraded videos.Wodajo et al. [39] combined vision transformers with CNNs (CViT) to capture local and global features from face images, but at the cost of increased computational complexity due to a high number of parameters.Hua et al. [40] proposed an interpretable model for fake face detection by establishing a patch-channel correspondence that provides evidence for fake face detection.However, this approach faces limitations in quantifying the degree of interpretability and optimizing the patch-channel correspondence because of strong channel correlation and computational complexity.

Forgery localization
In addition to classification, certain techniques are specifically designed to focus on localizing the manipulated regions.Nguyen et al. [41] utilized a multi-task learning strategy with a Y-shaped architecture to simultaneously locate modified video regions and detect manipulation.Li et al. [42] presented an X-ray approach for faces to detect boundaries around manipulated face regions.However, this method relies on external training data and has lower performance when image quality varies, such as compression or blurring, which can affect the detection of boundary traces.Liu et al. [43] introduced an automated machine-learning approach for deepfake detection and localization, reducing the need for manual network design.Dang et al. [44] proposed supervised and weakly supervised strategies for estimating image-specific attention maps to localize manipulated regions in face images.However, this approach is sensitive to compression.Therefore, it is crucial to prioritize robust localization of the manipulated regions to address the impact of compression.Our goal is to achieve consistent forgery localization even when image quality is compromised at different levels.Using pixel-level localization, our approach aims to improve the generalization performance of deepfake detection.Unlike previous methods focusing on spatial features, our approach simultaneously learns spatial features and frequency-related patterns.By effectively incorporating spatial and spectral information, we developed a multitask learning approach to classify facial images as real or fake and to localize modified regions in the face.For manipulation localization, we introduce an attention-based encoder-decoder architecture that integrates semantic information extraction.Our approach uses an attention-based U-Net architecture with frequency features in the detection stream, resulting in improved classification performance.To the best of our knowledge, this is the first study to use U-Net with a spatial and channel-specific attention mechanism for detecting and localizing face manipulations.

Proposed method
In contrast to single-objective approaches, our method utilizes attention-based localization and classification networks to generate the probability of an input image being forged or real and simultaneously provide localized maps highlighting manipulated regions within each input video frame, as shown in Fig. 3.The proposed model operates Fig. 3 Illustration of the pipeline used in the proposed method for detecting and localizing deepfake facial manipulation.Face image and its spectral (FFT) representation are used as input on a tuple dataset denoted as H = (A i , B i , D i , y i ) N i=1 , where A i ∈ R H ×W ×3 represents a 2D image of a face, B i ∈ R H ×W ×1 corresponds to the reference input mask for each fake face type, which includes the face swap mask that covers the entire face area and the expression swap mask, representing the facial structure as shown in Fig. 6.D i is the spectrum coefficient in the frequency domain, and y i ∈ (0, 1) serves as a label indicating whether the input A i has been manipulated or not.The subscript i represents data points from the face and frequency spectrum dataset.Our main objective is to train a model to determine whether a test image has been manipulated and, if so, to what extent.Localizing the modifications in facial images requires focusing on the specific regions affected by each type of manipulation.Thus, for Facial Manipulation Localization (FML), we use a dataset {(A i , B i )} N i=1 , while for Facial Manipulation Detection (FMD), we utilize a dataset of tuples {(D i , y i )} N i=1 .To classify manipulated images by combining attention-based spatial and spectral information from two network streams, we replace Global Average Pooling with Bilinear Pooling (BP).

Facial manipulation localization (FML)
We propose Residual U-Net with spatial channel attention block (scAB) to focus on the face regions for deepfake localization during the learning process.In our approach, the encoder directly receives the input image, and during the decoding phase, both the encoder and the features from the preceding layer decoder undergo processing through the scAB to generate decoder features, as illustrated in Fig. 3.At each skip connection of Residual U-Net, we employ the scAB to dynamically learn the location and semantic information.
Given an image A i passed through the network to produce encoder features f e ∈ R C×H ×W , and decoder features f d ∈ R C×H ×W , here, C, H, and W correspond to the feature map's channel count, height, and width, respectively.ScAB generates spatial attention map SAM ∈ R 1×H ×W and channel attention map CAM ∈ R C×1×1 using the encoder f e and decoder features f d as depicted in Fig. 4. The spatial attention block (sAB) directs the model's attention toward relevant deep spatial structures.On the other hand, the channel attention block (cAB) acts as a bridge, closing the semantic gap between the encoder and decoder features by incorporating extra contextual information into the lower-level encoding features, thereby enhancing the overall understanding of the data.
The SAM(f e , f d ) is computed by applying the Average Pooling and Max Pooling along the channel dimension of f e and f d .The resulting maps are summed-up and passed through a sigmoid function. (1) The convolution operation is represented as f v×v 1 with a filter of 7 × 7 size, and the sigmoid function is denoted as σ .The low-level encoder features possess valuable spa- tial details but lack semantic information.Combining low-level encoders with high-level decoders without considering semantic differences can adversely affect the localization results.To improve fusion effectiveness, we integrate semantic concepts into low-level features using convolutional feature inter-dependencies.This is accomplished through the CAM technique, which facilitates meaningful feature discrimination [45].
To calculate CAM(f e , f d ) , we employ Average Pooling and Max Pooling techniques to reduce spatial information in both encoder and decoder features, as inspired by [46].Subsequently, these compressed features are passed through N Dense layer with u units.It is essential to note that u varies for each Dense layer.The Dense layer operation is responsible for detecting channel dependencies and producing squeeze channel attention maps.The individual output attention maps CAM e (f e ) and CAM d (f d ) are then com- bined using element-wise summation.The resulting sum undergoes C 1 convolutions, followed by a sigmoid function, to obtain the final CAM representation.In summary, the computation of CAM(f e , f d ) is as follows: The input encoder layer features are enhanced by multiplying them with the scAB output, effectively incorporating the benefits of both SAM and CAM.This process is depicted in Fig. 4.
The element-wise multiplication operation ⊗ is applied to preserve the spatial and chan- nel dimensions of the input feature map in both SAM and CAM.The refined features F r , and the decoder features f d are concatenated and passed to the convolutional layer to build the decoder features for the next layer.

Facial manipulation detection (FMD)
To improve image quality, common upsampling techniques are employed in autoencoders [47] or GANs [26].These techniques increase the pixel dimensions vertically and horizontally by a factor of m, utilizing the low-resolution encoded image as input.By leveraging the property of Discrete Fourier Transform (DFT), Odena et al. [48] discovered that adding insignificant zeros to a low-resolution image is equivalent to overlaying multiple spectra of the low-resolution image onto the highfrequency region of the resulting high-resolution image.This discrepancy causes the frequency spectrum of deepfake images to deviate from real images, making them distinguishable [13].To extract forgery features in the frequency domain, we employ a two-dimensional Fast Fourier transform (2D FFT) on the input image A x i , resulting in the spectrum representation D i .The backbone network generates a convolu- tional feature map f f ∈ R H ×W ×C using D i as input.To direct the network's attention towards discriminative regions for classification, f f is processed through an atten- tion block, as illustrated in Fig. 4.
The output of the attention block is element-wise multiplied with the f f features, result- ing in the refined feature map F refinedf .
Once the frequency feature map F refinedf is obtained, these features are then com- bined with the spatial features F d1 from the first decoder layer of the U-Net.F d1 contains manipulation-aware features compared to the spatial features from the encoder's last layer.We utilize Bilinear Pooling (BP) to capture the comprehensive representation of these features.Bilinear Pooling merges features of different ( 4) dimensions and offers improved expressiveness compared to concatenation or element-wise product-based methods.Bilinear Pooling is computationally efficient and competitive with the best feature fusion strategies [49].In the BP block, features from F d1 and F refinedf are fused to compute the class probability.

Loss function
Four different loss functions, L1, L2, dice loss, and focal loss have been evaluated for manipulation localization network.L1 and L2 losses are commonly used in regression tasks, while dice loss and focal loss are typically utilized in classical segmentation tasks.Our findings show that L2 and L1 losses outperformed the segmentation losses, suggesting that the regression losses are more suitable for localized maps.Table 4 compares results using different loss functions.In addition, we combined U-Net with the classification network for training and employed binary cross-entropy loss.The overall loss is the weighted sum of the two activation losses, i.e., localization and classification loss: The two weights ( ρ class , ρ localize ) are set to 1.This is because classification and localiza- tion tasks are equally important.

Implementation details and evaluation settings
For all real/fake video frames, we employ MTCNN [50] to detect and crop the face region, saving the aligned facial images as inputs with a size of 224 × 224.ResNet [51] is used as a backbone network to extract spatial and frequency features.The model is trained using Adam optimizer [52] with an initial learning rate (LR) of 1e −4 , β1 = 0.9 , β2 = 0.999 epsilon = 1e −08 , and epsilon = 1e −08 .After 30 epochs, if the network does not improve, the learning rate drops to LR × 0.1 .We train our models on NVIDIA GeForce RTX-3060 Ti GPUs with batch size 16.We used various data augmentation techniques to prevent overfitting and encourage the model to learn identity-independent features rather than solely focusing on face recognition.These techniques include flipping, rotating, contrast change, adding Gaussian noise, and compression to simulate diverse scenarios.In order to evaluate the efficacy of our suggested method, we utilize well-established metrics for detecting deepfakes.These metrics include Accuracy (Acc), which is used for assessing the performance of the model within the FaceForensics++ [28] dataset, as demonstrated in studies by [12,51,[53][54][55].Additionally, we employ area under the ROC curve (AUC) for evaluating (10)  the model's performance across CelebDF [8], DFD [56], and DFDC-P [29] datasets, as shown in previous research by [10,12,16,17,28,33,37,41,57].Finally, we also use Mean Intersection over Union (mIoU) for further evaluation.To ensure fair comparisons with other techniques, we calculate average metric scores for all frames within a video.
1 FaceForensics++ (FF++) [28]: The dataset consists of 1000 original YouTube videos and 4000 fake videos generated using four manipulation algorithms: Deepfake (DF), FaceSwap (FS), Neural Textures (NT), and Face-to-Face (F2F).To ensure a balanced representation of real and fake data, 30 frames were selected from each fake video and 120 frames from each original video.Two different qualities of the dataset were used for training and testing: high quality (HQ) with a moderate compression ratio of 23 (C-23) and low quality (LQ) with a higher compression ratio of 40 (C-40).Higher compression results in lower video quality.The FF++ dataset now includes FaceShifter (FSH) face swapping videos, consisting of 10,000 fake videos created by manipulating real videos from the FF++ real videos.2 Deepfake Detection Challenge-Preview (DFDC-P) [29]: The dataset includes 4113 face swap deepfake videos alongside 1131 original footages.3 Celeb-DF [8]: This dataset contains 590 original videos and 5,639 fake face swap videos.4 Deepfake Detection Dataset (DFD) [56]: Google and Jigsaw contributed to the dataset, which includes 363 real videos and over 3600 face swapped deepfake videos.

Forgery detection results
We performed both inner and cross-data evaluations for the proposed approach.The training and testing sets were sourced from the same dataset for inner-dataset evaluation.In contrast, the cross-dataset evaluation involved training and testing on different datasets.

Inner-dataset evaluation
This section presents a comparison between our methods and established stateof-the-art techniques using an inner-dataset evaluation.Extensive research has been conducted on the task of deepfake detection [58].Only methods trained on FF++ HQ (C-23) and tested on FF++ LQ (C-40) were considered for this comparison.Frame-level results on the FF++ dataset are reported for fair comparisons.Table 2 summarizes the accuracy results of different state-of-the-art detectors.The reported results for [12,51,[53][54][55] are directly cited from [12] and [28].Our proposed method achieves comparable or superior performance compared to the current state-of-the-art approaches for low compression.Specifically, our approach demonstrates improved performance on Deepfake (DF) and Neural-Texture (NT) manipulations in both high-quality (C-23) and low-quality (C-40) videos.Our proposed solution shows slightly lower accuracy for F2F and FS manipulations.Despite the existence of ADD [43], and Multi-Task [41] approaches for localizing manipulation regions, our method outperformed them.This shows that improved results are possible even for compressed video by combining frequency and spatial domain information, suggesting that frequency spectrum features are resilient to compression.Highly compressed videos often exhibit poor quality, leading to the weakening of several frequency components.The performance enhancement from the attention block at both the spatial and frequency backbones helped to prioritize features with higher classification importance.We evaluated the trained model's performance on the FaceShifter dataset and obtained an accuracy of 95.88% for C-23 and 89.88% for C-40 compressed videos.Figure 5 shows the ROC results for inner dataset evaluation on the FF++ dataset.

Cross-dataset evaluation
Generalization ability is a key indicator of algorithm superiority, often evaluated through a cross-dataset evaluation.However, it is more practical to evaluate across datasets because it is often difficult to determine which modification approach was used for the test data.
This section focuses on the framework's adaptability to unseen datasets during training, highlighting its transferability through cross-dataset evaluation.To evaluate the proposed method's transferability and enable fair comparisons, we trained it on FF++ with multiple manipulations and conducted tests on CelebDF [8] and DFDC-P [29].Table 3 provides a comparison of AUC values with state-of-the-art face forgery detection methods.
Our method outperformed the most recent approaches in terms of AUC on the DFDC-P dataset and also performed well on Celeb-DF.In conclusion, CNN-based approaches [10,12,16,17,28,33,37,41,57] predominantly emphasize local features within facial images, lacking global information for comprehensive enhancement.As a result, these methods exhibit limited transferability when cross-evaluated on DFDC-P and Celeb-DF datasets.Moreover, CViT approach [39], which employs the convolutional vision Transformer (ViT), experiences a decline in performance for inner-dataset evaluation on FF++ as compared to other state-of-the-art techniques [11,16,17,24,33,37,38,43]. On the other hand, the recent state-of-the-art model MaDD [24] demonstrates relatively competitive performance in both within-dataset and cross-dataset evaluations compared to earlier approaches.In particular, our approach achieved a 4% higher AUC (area under the curve) compared to the highest reported approach, MesoNET [12], in evaluating the DFDC-P dataset.This improvement can be attributed to our method's emphasis on the input image's frequency and spatial components.While the two-branch [11] approach showcased superior transferability on the CelebDF dataset, our method outperformed it on DFDC-P dataset performance.

Manipulation localization results
The dual attention block scAB combines encoder and previous layer decoder features at each skip connection of ResU-Net.This integration allows the model to learn discriminative image features for manipulation localization while disregarding irrelevant pixels, as shown in Fig. 6.In training the proposed model, inverted FF++ ground truth masks are employed as input for fake and real faces.The inversion process involves representing the manipulated area with black pixels, while the real, unmanipulated area is depicted as white pixels as shown in Fig. 6.This inversion technique enhances the visualization of the model's predictions, as it distinctly highlights the regions where the face is manipulated.
As depicted in Fig. 1, facial expression modification usually occurs in regions such as the eyes, lips, and eyebrows.In deepfake face swap, all facial attributes except hair and ears are replaced.Figure 6 displays localized maps for original and fake images, demonstrating the model's effective learning of fake facial regions and accurate localization of potential manipulation pixels in each sample.The network maintains exceptional localization capabilities across all layers, particularly in accurately localizing the mouth, eyebrows, and eyes for expression changes.The network effectively localizes the facial region that has been transferred to the target image in face swap scenarios.In evaluating the face manipulation localization network, we examined four different loss functions using both original and fake images from the FF++ dataset for training and testing.The accuracy results, presented in Table 4, demonstrated that regression losses (L1 and L2) outperformed traditional segmentation losses in accurately localizing real and fake faces.
Next, we conduct a comparative analysis between our model and other approaches that utilize multi-task learning to enhance generalization capabilities.These approaches include LAE [62] and Multi-Task [41], and ADD [43].These methods simultaneously perform forgery localization and classification.Following the same experimental setup as these methods, we train our model on the F2F (HQ) dataset and evaluate its effectiveness on both the F2F (HQ) and FS (HQ) datasets to measure its cross-dataset performance.The reported statistics for the competing methods can be found in the respective papers.As depicted in Table 5, our proposed method demonstrates better performance over the approaches [41,43,62] for cross-dataset evaluation.
To evaluate the impact of augmentation on analyzing unseen test sets, we trained our model on the FF++ (HQ) dataset comprising (DF, FSH, and real videos) with and Table 2 Quantitative results in terms of ACC (%) on the FF++ [28] dataset were obtained for four different manipulation methods, including Deepfake (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT) This table summarizes the results, with"LQ" indicating low image quality, "HQ" indicating high image quality, "RGB" representing color images, and "FREQ" indicating frequency input.The best results are highlighted in bold font, while "-" indicates unavailable results Table 3 Comparison of AUC (%) for cross-dataset evaluation on CelebDF [8] and DFDC-P [29], including results of other methods cited from [11,24,37,38,60,61] Bold values indicate the best performace against the specfic dataset in each column Fig. 6 First and second row from each deepfake manipulation type (DF, FS, FSH, NT, F2F) show the original images and manipulated ones, respectively.Third row shows the ground truth masks, while the bottom row represents the predicted mask from our proposed approach without data augmentation.Subsequently, we assessed the model's performance on the DFDC-P dataset.To showcase the effectiveness comprehensively, we conducted tests on both the DFDC-P(real) and DFDC-P (fake) datasets, as presented in Table 6.The table showcases the effectiveness of augmentation in improving cross-data performance.The first row, labeled "w-aug", corresponds to the training approach with augmentation, while the second row, labeled "w/o-aug", refers to training without data augmentation.Upon analyzing the evaluation results within the dataset in Table 2 and the cross-data assessment results in Table 3, noticeable performance variations are observed among unseen datasets.This substantiates the challenges posed by the distribution gap between the seen and unseen datasets regarding generalization accuracy.Our future study will focus on exploring additional features, such as background context or voice, to examine if they can contribute to further reducing the generalization gap.In particular, we will investigate the potential of incorporating a limited amount of data from unseen datasets for fine-tuning the model.

Ablation study
We independently performed ablation experiments for each localization and detection branch with different network configurations to assess the effectiveness of each component in the proposed approach.

Deepfake manipulation detection network
We quantitatively assessed the significance of the detection model's component in understanding the detection efficiency of the proposed network.We compared the output from (a) the detection branch trained with only frequency domain features, (b) the detection branch using features from the frequency and spatial domain with no ScAB block, and (c) a combination of both frequency and spatial domain with ScAB block.To show this, all models are trained on FF++ (HQ) and evaluated on FF++ (LQ) and DFD datasets.
In contrast to combining information from the spatial and frequency domains, we find that employing features from the frequency domain alone does not yield satisfactory results.One should not discard all spatial information and depend solely on frequency domain parameters for classification.Instead, combining both domains boosts performance considerably.A simple hard combination of features from both domains using a Bilinear Pooling layer enhances the performance.However, in this case, there is no  7 Component analysis on the proposed detection branch for the high-quality (HQ), lowquality (LQ) FF++, and DFD datasets.Each component is gradually incorporated and evaluated to compare the ACC (%) results information about the manipulation location on the face, giving limited room for information flow between both domains.We used input from the first decoder layer with ScAB block to show the impact of integrating pixel-wise forgery localized spatial information with frequency domain features.This permits only the altered spatial pixels to be shared with the detection branch rather than features from the entire face, thereby learning a better optimal combination of shared representations from both feature domains.Table 7 and Fig. 7 illustrate that the optimal results for within dataset and cross datasets are achieved through the combination of frequency and localized spatial domain features.

Deepfake manipulation localization network
In this ablation study, we aimed to investigate the impact of spatial and channel attention on the performance of the localization branch.We conducted quantitative experiments and provided visualization to demonstrate the importance of the ScAB block.We compared the performance of three models trained with different components: • Model-A: Localization branch with spatial channel attention block (scAB).
• Model-B: Localization branch with spatial attention block only (sAB).
• Model-C: Localization branch without any attention block (W/O AB).

Localization results from Model-A, Model-B, and Model-C on FF++ dataset
We conducted an ablation study on the localization network using FF++ (C23) and FF++ (C40) for training and evaluation.The results are summarized in Table 8.The study revealed that including attention blocks significantly improved the performance of the localization branch.Model-C, with no attention block, exhibited the lowest performance.On the FF++ (C23) dataset, Model-B, relying on a spatial attention block (sAB) only, outperformed Model-A.However, it was observed that on highly compressed images from the FF++ (C40) dataset, where a substantial amount of information was lost due to high compression, the sAB was outperformed by ScAB.This outcome can be attributed to the compression affecting both local image features and their surroundings.The reliance of Model-A on local image features and Model-B solely on spatial attention to expand the areas of local image features can lead to inaccuracies in attention weights and mislocalizations, particularly due to the loss of crucial details in highly compressed images.In contrast, the scAB, capturing not only global but also contextual information through channel attention, proved to be especially effective for the challenging FF++ (C40) dataset.Consequently, Model-A achieved the highest performance on FF++

Conclusion
This paper addresses the problem of detection and localization of faces in deepfake images using a multi-task learning approach.Our proposed method incorporates an attention mechanism to process the feature maps for both detection and localization tasks.By enabling information exchange between these tasks, we observed an overall improvement in the network's performance, particularly for unseen datasets.
To enhance the performance and provide localization of face forgery, we introduce a strategy involving the combination of the encoder and preceding layer decoder with dual attention block scAB.This approach localizes the manipulated facial regions at the pixel level.Through extensive experiments on three deepfake benchmarks, we demonstrate that our model tends to focus on the forgery regions instead of unwanted biases and artifacts, leading to more accurate predictions.Furthermore, we empirically show that the utilization of multiple attention blocks enhances the model's ability to localize manipulated regions.This improvement contributes to achieving state-of-the-art performance in forgery detection.With the improved visual quality of deepfake generated faces, the detection problem remains highly challenging, resulting in a generalization gap between training and unseen datasets created by other approaches.In our future research, we aim to address this generalization gap by combining transfer learning with additional strategies, creating a comprehensive framework to further narrow this disparity.

Page 2 of 21
Waseem et al.EURASIP Journal on Image and Video Processing (2023) 2023:14

Fig. 1
Fig. 1 Illustration of deepfake face swap and expression swap manipulations along with the corresponding localization maps generated by the proposed approach.The localization maps emphasize the specific regions on the face that have undergone manipulation

Fig. 4
Fig. 4 Illustration of Spatial Channel Attention Block (scAB) and attention block for frequency spectrum features

Fig. 7
Fig. 7 The detection performance of different features on FF++ and DFD datasets.ID1 only uses frequency domain features, ID2 uses spatial and frequency domain features without a ScAB block, and ID3 includes a ScAB block

Figure 8
showcases the localization results of each model.Model-A, equipped with both spatial and channel attention blocks, exhibited more focused attention on the manipulated pixel regions, incorporating local, global, and contextual information through spatial and channel attention from the image.Furthermore, the localization results obtained with ScAB successfully highlight the forged pixels of the manipulated regions, even in low-quality FF++ faces.This highlights the effectiveness of ScAB in addressing the challenges posed by low-quality images.

Fig. 8
Fig. 8 Localization results from Model-A, Model-B, and Model-C on FF++ dataset

Table 1
Summary of face and expression swap datasets

Table 4
Comprehensive evaluation of localization loss functions on FF++ dataset

Table 5
Facial manipulation localization performance in terms of accuracy on Face2Face and FaceSwap datasets with high video qualityBold values indicate the best performace against the specfic dataset in each column

Table 6
Localization network cross-data evaluation on DFDC-P with and without augmentation

Table 8
Comprehensive evaluation of localization performance using AUC and mIoU metrics for all models on the FF++ dataset with two levels of video quality