Skip to main content

Face image de-identification based on feature embedding

Abstract

A large number of images are available on the Internet with the growth of social networking services, and many of them are face photos or contain faces. It is necessary to protect the privacy of face images to prevent their malicious use by face image de-identification techniques that make face recognition difficult, which prevent the collection of specific face images using face recognition. In this paper, we propose a face image de-identification method that generates a de-identified image from an input face image by embedding facial features extracted from that of another person into the input face image. We develop the novel framework for embedding facial features into a face image and loss functions based on images and features to de-identify a face image preserving its appearance. Through a set of experiments using public face image datasets, we demonstrate that the proposed method exhibits higher de-identification performance against unknown face recognition models than conventional methods while preserving the appearance of the input face images.

1 Introduction

Face recognition is one of practical biometric technologies that uses characteristics such as position of facial parts, their geometric relation, and textures [1]. By using face images captured by standard cameras implemented in information terminals for authentication, face recognition does not require a dedicated sensor and is hygienic and convenient to use. These features make face recognition a widely used method for smartphone login, airport immigration control, and other applications. A face contains more information than other biometric characteristics, such as a fingerprint and iris, in addition to its individuality to identify a person. For example, a face contains attribute information, such as age, gender, race, and hair color, that are used for marketing, image retrieval, and criminal investigation. When a face image is used for purposes not intended by the individual, such use is considered to be a privacy violation [2]. Since a variety of information can be extracted from face images, it is necessary to protect privacy before releasing face images to the public.

With the growth of Social Networking Services (SNS), a large number of images are available on the Internet, and many of them are face photos or contain faces. Face recognition technology can be used to easily collect face images of a specific person from the Internet. These collected images may be used for behavioral tracking, spoofing attacks against face recognition systems, and constructing face image datasets. Therefore, it is necessary to protect the privacy of face images to prevent their malicious use. By applying image processing to face images that makes face recognition difficult, we can prevent individuals from being identified and reduce the risk of face images from such malicious use [3]. This paper focuses on face image de-identification techniques that make face recognition difficult, which prevent the collection of specific face images using face recognition.

Ad hoc de-identification approaches make face recognition difficult by applying masking [4], blurring [5], and pixelization [6] to face images. Face images de-identified by these methods, i.e., de-identified images, cannot be used for personal authentication since they do not preserve the appearance of the original face image, resulting in a limited range of applications. In addition, when de-identified images are posted on SNS, it is important that the original appearance is preserved in the de-identified image. De-identification methods [7,8,9,10,11] that make face recognition difficult without significantly changing the face image utilize Adversarial Examples (AEs) [12], Generative Adversarial Networks (GAN) [13], and a diffusion model [14]. Among them, the methods using AEs that generate images with perturbations that induce errors in the classification models are major approaches for de-identification. In de-identification methods [7,8,9] using AEs, perturbations that make face recognition difficult are generated, and these perturbations are added to the face image to obtain a de-identified image. A larger perturbation improves the performance of de-identification, resulting in a significant change in the appearance of the face image. A smaller perturbation preserves the appearance of the face image, while it does not provide sufficient de-identification performance. Therefore, it is necessary to consider the balance between the image quality and the de-identification performance of the de-identified images. These methods may degrade the de-identification performance against unknown face recognition models since the generated perturbations depend on the face recognition model used in training. Since the face recognition model actually used by the attacker is unknown, it is necessary to have sufficient de-identification performance against unknown face recognition models. In addition, perturbations in AEs are generated at the pixel level to make face recognition difficult, resulting in a degradation of de-identification performance due to simple image processing, such as blurring. It is also necessary for the de-identified images to be robust against image processing.

Addressing the above problems, we propose a face image de-identification method based on feature embedding. The proposed method generates a de-identified image that is recognized as another person by a face recognition model while preserving the appearance of the target face image by embedding facial features extracted from that of another person into the target face image. We improve the image quality and the de-identification performance against unknown face recognition models by training a convolutional neural network (CNN) that generates de-identified images using loss functions based on images and features. The proposed method is robust to image processing since face images are de-identified by embedding facial features. On the other hand, the de-identified image generated by the proposed method may be recognized by the person whose features are embedded in the target face image. To solve this problem, the proposed method embeds the facial features extracted from a fake face image generated by a face image generation model. In training CNN used in the proposed method, we use public face image datasets: Large-scale CelebFaces Attributes (CelebA) [15] and Generated Faces by StyleGAN2 (GFSG2),Footnote 1 which are generated by StyleGAN2 [16] and in the performance evaluation, we use Labeled Faces in the Wild (LFW) [17]. Through a set of experiments, we demonstrate that the proposed method exhibits higher de-identification performance against unknown face recognition models than conventional methods while preserving the appearance of the target face images, and that the proposed method is robust to image processing.

Note that this paper is an extended version of our conference paper [18]. The major differences are the addition of the loss function used for training the proposed method, the ablation study on the loss functions, the analysis of the de-identified images generated by each method, and the evaluation of the robustness of the de-identified images generated by each method to image processing.

2 Related work

This section describes conventional face image de-identification methods and attack methods that degrade de-identification performance.

2.1 Face image de-identification

The traditional methods of face image de-identification are masking [4], blurring [5], and pixelization [6] of face images. Figure 1 shows examples of de-identified face images obtained by masking, blurring, and pixelization. As shown in these examples, these methods make face images difficult to identify for both humans and face recognition models, resulting in the limited applications of such de-identified face images. When a person posts his or her own face images on the Internet, such as in SNS, it is important that the face images can be recognized by humans. Therefore, it is necessary to develop a de-identification method that makes face recognition difficult while preserving the appearance of the original face images.

Fig. 1
figure 1

Examples of de-identified face images obtained by masking [4], blurring [5], and pixelization [6] of face images

Several methods have been proposed to de-identify face images without significantly changing their appearance using deep learning. Major methods [7,8,9] utilize AEs [12], which are images that are perturbed to induce errors in the classification models. Face images can be de-identified by adding perturbations that make face recognition difficult. Larger perturbations enhance the de-identification performance, although they significantly change the appearance of the face images. On the other hand, smaller perturbations preserve the appearance of the face images, although they do not provide sufficient de-identification performance. Therefore, a balance between the appearance of the de-identified image and the de-identification performance is important in face image de-identification. Yang et al. proposed two types of face de-identification methods: Landmark-Guided Cutout (LGC) [7] and the Targeted Identity-Protection Iterative Method (TIP-IM) [8]. LGC [7] adds constraints based on facial landmarks to the Momentum Iterative Method (MIM) [19], which can be applied to a variety of images, and specializes on de-identification of face images. The balance between appearance and de-identification performance can be adjusted by the hyperparameter \(\epsilon\) that controls the magnitude of the perturbation. TIP-IM [8] generates de-identified images based on the Maximum Mean Discrepancy (MMD) [20], which is the difference between the data distribution of a set of original face images and a set of de-identified images. The balance between appearance and de-identification performance can be adjusted by the weight \(\gamma\) of the MMD-based loss function in addition to \(\epsilon\). Shawn et al. proposed Fawkes [9] that constrains perturbations based on the Structural Similarity Index Measure (SSIM) [21], which measures the similarity between the appearance of the original face image and the appearance of the de-identified image. The balance between appearance and de-identification performance can be adjusted through three modesFootnote 2 that control the magnitude of the perturbation. As shown in Fig. 2, the larger the perturbation, the better the de-identification performance, while the appearance of the face image changes significantly. A smaller perturbation preserves the appearance of the face image, while not providing sufficient de-identification performance. There is a trade-off between the image quality and the de-identification performance of the de-identified images. In addition, the perturbations added to AEs are highly depending on the face recognition model on which the perturbation is generated. Although AEs are effective for the face recognition model on which the perturbation is generated, they may not exhibit sufficient de-identification performance for unknown face recognition models. The face recognition model targeted by the attackers is basically unknown. Therefore, the face de-identification methods independent of face recognition models are necessary to enhance the practicality of face image de-identification.

In addition to AE-based methods, there are de-identification methods [10, 11] using GAN [13] and a diffusion model [14]. The method [10] using GAN generates a face image using StyleGAN [22] and optimizes the latent variables input to StyleGAN so that a de-identified face image is generated for that image while preserving the individuality of the input image. The method [11] using the diffusion model iteratively performs the sampling process of the diffusion model so that a de-identified face image is generated for that image while preserving the individuality of the input image. Both methods preserve the appearance of the input image since there is no significant noise appearing, compared to the methods using AEs. On the other hand, since the codes of both methods are not publicly available and cannot be compared, this paper focuses on comparing the de-identification performance of the proposed method with that of the AEs-based methods whose codes are publicly available.

Fig. 2
figure 2

Examples of de-identified face images obtained by an AE-based method with changing the balance parameter

2.2 Attacks against face image de-identification

The effect of de-identification can be reduced by applying defense methods in adversarial attacks to de-identified images. Adversarial learning [12, 23] is a major defense method in adversarial attacks. We can suppress false positives caused by de-identified images by training a face recognition model using a dataset that includes de-identified images. On the other hand, face recognition models with adversarial learning are known to degrade the recognition accuracy against real face images [24]. Several methods have been proposed to classify whether an image is real or de-identified before it is input to a face recognition model to prevent degradation of recognition accuracy [25,26,27]. These methods are implemented in face recognition systems for the purpose of detecting de-identification as an attack. There are several methods to reduce the effect of de-identification, such as Median Blur [28], Bit-depth Reduction [28], Gaussian Blur [29], and JPEG Encoding [29]. By applying these methods to de-identified images obtained by AE-based methods, the de-identification performance can be reduced without depending on the face recognition models.

3 Face image de-identification based on feature embedding

This section describes a de-identification method for face images by embedding facial features of other persons into the face images. The proposed method is inspired by deep steganography [30, 31], which generates a stego image by embedding another image into the input image, i.e., cover image, while preserving the appearance of the input image. Face images de-identified by the proposed method have high image quality since the face images are not perturbed like AEs. Figure 3 illustrates the overview of the proposed method, which is used in the inference phase. We use the following notation: the face image to be de-identified is denoted as cover image C, the original face image of the facial features to be embedded in C is denoted as E, and the de-identified face image is denoted as D. We describe below the details of the network architecture of the proposed method and the loss functions used in training.

Fig. 3
figure 3

Overview of the proposed method, which is used in the inference phase

3.1 Network architecture

The proposed method consists of an Extracting Network (EN) that extracts facial features \(f_{\textrm{en}}(E)\) from a face image E and a Hiding Network (HN) that embeds facial features \(f_{\textrm{en}}(E)\) into a face image C. EN is a trained face recognition model, which is used to extract facial features \(f_{\textrm{en}}(E)\), \(f_{\textrm{en}}(C)\), and \(f_{\textrm{en}}(D)\) from an embedding image E, a cover image C, and a de-identified image D, respectively. Note that the same face recognition model \(f_{\textrm{en}}\) must be used to extract \(f_{\textrm{en}}(E)\) and \(f_{\textrm{en}}(D)\) in training, while a different face recognition model can be used for feature extraction from de-identified face images D in test. The size of face images E, C, and D is \(256 \times 256\) pixels with RGB channels and the size of facial features extracted from face images is \(1 \times 512\). HN generates de-identified image D by embedding facial features \(f_{\textrm{en}}(E)\) into cover image C to be de-identified. Figure 4 shows the network architecture of HN used in the proposed method. HN is designed based on U-Net [32] that is widely used as an image generation model. U-Net consists of an encoder and a decoder, and these are connected by skip connections to suppress gradient vanishing. We employ residual blocks used in ResNet [33] in the encoder to further suppress gradient vanishing. The facial features \(f_{\textrm{en}}(E)\) are not directly embedded, but are replicated to the same size as cover image C before embedding. First, a \(1 \times 512\) face feature \(f_{\textrm{en}}(E)\) is transformed into a 2D matrix of \(2 \times 256\) and expanded to \(256 \times 256\) by duplicating and merging 128 of them in the height direction. Next, this \(256 \times 256\) matrix is replicated three times and combined in the channel direction to expand it to the same size as cover image C with \(3 \times 256 \times 256\). Then, the expanded facial features \(f_{\textrm{en}}(E)\) are concatenated with cover image C in the channel direction, and the \(6 \times 256 \times 256\) matrix data are input to HN.

Fig. 4
figure 4

Network architecture of HN used in the proposed method

3.2 Loss functions

We use three loss functions to control the appearance and two loss functions to control the de-identification performance in the training of HN. Figure 5 shows an overview of the loss functions that control appearance, and Fig. 6 shows an overview of the loss functions that control the de-identification performance. Note that both flows shown in Figs. 5 and 6 are used in the training phase.

Fig. 5
figure 5

Loss functions controlling appearance in the training of HN

Fig. 6
figure 6

Loss functions controlling de-identification performance in the training of HN

3.2.1 Loss functions controlling appearance

We describe the details of three loss functions that control the appearance of the de-identified face image as shown in Fig. 5.

(i) Reconstruction loss \(\mathcal {L}_{\textrm{rec}}\): \(\mathcal {L}_{\textrm{rec}}\) is a loss function that reduces the pixel-wise difference between cover image C and de-identified image D, and is defined by:

$$\begin{aligned} \mathcal {L}_{\textrm{rec}} = ||C - D||^2_2. \end{aligned}$$
(1)

(ii) Perception loss \(\mathcal {L}_{\textrm{perc}}\) [34]: Cover image C and de-identified image D are input to VGG-19 [35] trained on ImageNet [36] to obtain the features \(f_{\textrm{vgg}}(C)\) and \(f_{\textrm{vgg}}(D)\) output from the final layer. Note that the trained VGG-19 is used according to [34]. \(\mathcal {L}_{\textrm{perc}}\) is a loss function that reduces the difference between \(f_{\textrm{vgg}}(C)\) and \(f_{\textrm{vgg}}(D)\), and is defined by:

$$\begin{aligned} \mathcal {L}_{\textrm{perc}} = ||f_{\textrm{vgg}}(C) - f_{\textrm{vgg}}(D)||_1. \end{aligned}$$
(2)

The image features output from the final layer of VGG-19 include global features related to the color and object shape of the whole image. By reducing the differences between the image features output from the final layer, the appearance of the images can be made closer to human perception.

(iii) Learned Perceptual Image Patch Similarity (LPIPS) loss \(\mathcal {L}_{\textrm{lpips}}\) [37]: \(\mathcal {L}_{\textrm{lpips}}\) is designed based on Learned Perceptual Image Patch Similarity (LPIPS) [37], which is an image quality metric. Cover image C and de-identified image D are input to VGG-16 [35] trained on ImageNet [36] to obtain the features \(c^l\) and \(d^l\) \((l=1, 2, \ldots , L)\) output from each layer, where \(c^l=f^l_{\textrm{vgg}}(C)\), \(d^l=f^l_{\textrm{vgg}}(D)\), and L is the total number of layers in VGG-16. Note that the trained VGG-16 is used according to [37]. \(\mathcal {L}_{\textrm{lpips}}\) is a loss function that reduces the difference between the weighted sum of \(c^l\) and \(d^l\), and is defined by:

$$\begin{aligned} \mathcal {L}_{\textrm{lpips}} = \sum ^L_{l=1}\frac{1}{H^lW^l}\sum ^{H^l}_{i=1}\sum ^{W^l}_{j=1}||\alpha ^l \odot (c^l_{ij}-d^l_{ij})||^2_2, \end{aligned}$$
(3)

where \(H^l\) and \(W^l\) are the height and width of the feature map output from layer l, respectively. \(\alpha ^l\) indicates the weights of each channel for the features output from layer l, and \(\odot\) indicates an operator for element-wise product. By reducing the difference between the image features output in the intermediate layers, the appearance between images is made closer by taking into account the local features extracted in the shallow layers and the global features extracted in the deep layers. \(\mathcal {L}_{\textrm{lpips}}\) can make the appearance of images closer to human perception than \(\mathcal {L}_{\textrm{perc}}\). In the calculation of LPIPS, AlexNet [38] or VGG-16 trained on the ImageNet dataset is used as the feature extractor [37]. In this paper, VGG-16 is used to calculate \(\mathcal {L}_{\textrm{lpips}}\) for fair evaluation since AlexNet is used to calculate LPIPS in the performance evaluation metric.

3.2.2 Loss functions controlling de-identification performance

We describe the details of two loss functions that control the de-identification performance of the de-identified face image as shown in Fig. 6.

(iv) De-identification near loss \(\mathcal {L}_{\textrm{near}}\): The face image E and de-identified image D are input to EN to obtain facial features \(f_{\textrm{en}}(E)\) and \(f_{\textrm{en}}(D)\), respectively. \(\mathcal {L}_{\textrm{near}}\) is a loss function that makes \(f_{\textrm{en}}(D)\) extracted from D similar to \(f_{\textrm{en}}(E)\) extracted from E by increasing the cosine similarity between \(f_{\textrm{en}}(E)\) and \(f_{\textrm{en}}(D)\), and is defined by:

$$\begin{aligned} \mathcal {L}_{\textrm{near}} = 1-\cos (f_{\textrm{en}}(E), f_{\textrm{en}}(D)), \end{aligned}$$
(4)

where \(\cos (f_{\textrm{en}}(E), f_{\textrm{en}}(D))\) indicates the cosine similarity between \(f_{\textrm{en}}(E)\) and \(f_{\textrm{en}}(D)\). By making the facial feature \(f_{\textrm{en}}(D)\) similar to the facial feature \(f_{\textrm{en}}(E)\), HN generates the de-identified image D that the face recognition model will recognize as the person in the face image E.

(v) De-identification far loss \(\mathcal {L}_{\textrm{far}}\): \(\mathcal {L}_{\textrm{far}}\) is a loss function that makes \(f_{\textrm{en}}(C)\) extracted from C dissimilar to \(f_{\textrm{en}}(D)\) extracted from D by decreasing the cosine similarity between \(f_{\textrm{en}}(C)\) and \(f_{\textrm{en}}(D)\), and is defined by:

$$\begin{aligned} \mathcal {L}_{\textrm{far}} = \cos (f_{\textrm{en}}(C),f_{\textrm{en}}(D)), \end{aligned}$$
(5)

where \(\cos (f_{\textrm{en}}(C), f_{\textrm{en}}(D))\) indicates the cosine similarity between \(f_{\textrm{en}}(C)\) and \(f_{\textrm{en}}(D)\). By making the facial features \(f_{\textrm{en}}(D)\) dissimilar to the facial features \(f_{\textrm{en}}(C)\), HN generates the de-identified image D that the face recognition model cannot recognize as the person in the cover image C.

3.2.3 Total loss function

The total loss function \(\mathcal {L}\) for training HN is defined by:

$$\begin{aligned} \mathcal {L} = \lambda _{\textrm{rec}}\mathcal {L}_{\textrm{rec}} + \lambda _{\textrm{perc}}\mathcal {L}_{\textrm{perc}} + \lambda _{\textrm{lpips}}\mathcal {L}_{\textrm{lpips}} + \lambda _{\textrm{near}}\mathcal {L}_{\textrm{near}} + \lambda _{\textrm{far}}\mathcal {L}_{\textrm{far}}, \end{aligned}$$
(6)

where \(\lambda _{\textrm{rec}}\), \(\lambda _{\textrm{perc}}\), \(\lambda _{\textrm{lpips}}\), \(\lambda _{\textrm{near}}\), and \(\lambda _{\textrm{far}}\) are weights for \(\mathcal {L}_{\textrm{rec}}\), \(\mathcal {L}_{\textrm{perc}}\), \(\mathcal {L}_{\textrm{lpips}}\), \(\mathcal {L}_{\textrm{near}}\), and \(\mathcal {L}_{\textrm{far}}\), respectively. De-identified image D that preserves the appearance of C is generated by using \(\mathcal {L}_{\textrm{rec}}\), \(\mathcal {L}_{\textrm{perc}}\), and \(\mathcal {L}_{\textrm{lpips}}\). De-identified image D that makes the face recognition model misidentify D as the person in E is generated by using \(\mathcal {L}_{\textrm{near}}\) and \(\mathcal {L}_{\textrm{far}}\). In this paper, we introduce \(\mathcal {L}_{\textrm{far}}\) from our initial investigation [18]. Since we have empirically confirmed that the use of all the loss functions are effective, we evaluate the effectiveness of \(\mathcal {L}_{\textrm{near}}\) and \(\mathcal {L}_{\textrm{far}}\) in this paper.

4 Experiments and discussion

This section describes a variety of experiments using the public datasets to demonstrate the effectiveness of the proposed method. We describe the datasets used in the experiments, experimental conditions, evaluation metrics, the ablation study of the loss functions, the types of images to be embedded, comparisons with conventional methods, and the tolerance of de-identification to image processing.

4.1 Datasets

In the training of the proposed method, we use CelebA [15], which is a large-scale public face image dataset, and GFSG2 [16], which is a synthetic face image dataset generated by StyleGAN2. CelebA [15] consists of 202,599 face images of 10,177 persons. The randomly selected 199,599 images are used for training, and the remaining 3000 images are used for validation. GFSG2 [16] consists of 5000 synthetic face images, and all of them are used for training. We use LFW [17] to evaluate the performance of de-identified images. LFW [17] consists of 13,233 face images of 5749 persons. According to the evaluation protocol recommended by LFW, we extract 3000 pairs of face images of the same person, i.e., the genuine pair, for evaluating the performance of de-identification in 1-to-1 matching. In addition to these pairs, 200 persons are randomly selected, three images are extracted per person, and a total of 600 face images are used for evaluating the performance of de-identification in 1-to-N matching. All the face images are resized to \(256 \times 256\) pixels. Figure 7 shows examples of face images in CelebA, GFSG2, and LFW.

Fig. 7
figure 7

Examples of face images in each dataset used in the experiments

4.2 Experimental condition

We use the face images from the CelebA dataset as the cover image C and the face image E to be embedded in training HN. If a fake face image is used for face image E, we use the face image from the CelebA dataset for cover image C and the generated face image from the GFSG2 dataset for face image E. We employ data augmentation that randomly flips cover image C and face image E to the left and right, respectively, during training. We train for 150 epochs using Adam [39] as the optimizer. The initial learning rate is set to \(10^{-5}\), and is multiplied by 0.2 if the loss on the validation data does not improve for 5 consecutive epochs. After finishing the training, the performance of the proposed method is evaluated using the model with the weight parameters of the epoch with the lowest loss on the validation data. Table 1 shows EN that extracts features to be embedded in the cover image C used in the proposed method, and the face recognition models used in the performance evaluation. We use “ArcFace”, which consists of iResNet50Footnote 3 as the backbone and ArcFace [40] as the loss function, as EN of the proposed method. In the performance evaluation, three face recognition models, “FaceNet”, “Softmax”, and “CosFace”, are used. “FaceNet” consists of InceptionResNet [41] as the backbone and the triplet loss [42] as the loss function, “Softmax” consists of iResNet50 [40] as the backbone and the softmax loss as the loss function, and “CosFace” consists of ResNet50 [33] as the backbone and the Large Margin Cosine Loss (LMCL) [43] as the loss function. By using face recognition models different from EN, which extracts facial features embedded in de-identified images D, for performance evaluation, we conduct a black box evaluation, i.e., evaluating the de-identification performance for unknown face recognition models.

Table 1 Face recognition models used for EN in the proposed method and for the performance evaluation

4.3 Evaluation metrics

In the experiments, we evaluate the image quality of de-identified images and the de-identification performance in 1-to-1 matching (verification) and 1-to-N matching (identification).

For image quality evaluation, the similarity between the cover image C and the de-identified image D is measured. We use peak signal-to-noise ratio (PSNR), SSIM [21], and LPIPS [37] as the evaluation metrics for image quality. PSNR is used to quantitatively evaluate the image quality of converted images in image transformations, such as image compression. PSNR evaluates image quality based on the mean squared error (MSE) of pixel values, and when evaluating 8-bit images, is defined by:

$$\begin{aligned} \textrm{PSNR}(A,B) = 10\log _{10} \frac{255^2}{\textrm{MSE}(A,B)} \ \ \mathrm{[dB]}, \end{aligned}$$
(7)

where A and B are the image before and after the transformation, respectively. Higher PSNR indicates higher image quality. PSNR cannot distinguish between small changes that occur in the whole image and large changes that occur in the local regions. Addressing the above problem, SSIM is used as an image quality evaluation metric that is closer to human perception. SSIM evaluates image quality based on the difference between pixel value, contrast, and structure, and when evaluating for 8-bit images, is defined by:

$$\begin{aligned} \textrm{SSIM}(A,B) =\frac{(2\mu _A\mu _B+K_1)(2\sigma _{AB}+K_2)}{(\mu _A^2+\mu _B^2+K_1)(\sigma _A^2+\sigma _B^2+K_2)}, \end{aligned}$$
(8)

where \(\mu _A\) and \(\mu _B\) are the means of the pixel values of images A and B, respectively, \(\sigma _A\) and \(\sigma _B\) are the standard deviations of images A and B, respectively, \(\sigma _{AB}\) is the covariance of pixel values of images A and B, and \(K_1\) and \(K_2\) are small constants to avoid zero division, respectively. Higher SSIM indicates higher image quality. LPIPS is an image quality evaluation metric that is closer to human perception than PSNR and SSIM. LPIPS evaluates image quality based on the difference of weighted sums of features output from each layer by inputting images A and B to AlexNet [38] trained on the ImageNet dataset, respectively, as mentioned in Sect. 3.2.1, and is defined by Eq. (3). Lower LPIPS indicates higher image quality. In this paper, PSNR and SSIM are used as reference and we focus on the evaluation by LPIPS.

The performance of de-identification in 1-to-1 matching is evaluated using the genuine pairs consisting of two face images acquired by the same person. First, for each of the 3000 genuine pairs \((I_1,I_2)\) in the evaluation dataset, we de-identify one face image \(I_1\) to obtain a de-identified image \(D_1\). Next, \(D_1\) and \(I_2\) are input to the face recognition model, and the facial features \(f_1\) and \(f_2\) are extracted. Finally, we calculate the cosine similarity between the facial features \(f_1\) and \(f_2\), and determine whether the face is a genuine or an impostor based on the threshold. In this paper, we perform matching of 6000 pairs of the LFW dataset, i.e., 3000 genuine pairs and 3000 impostor pairs, in advance, and use the threshold with the highest accuracy rate. The performance of de-identification in 1-to-1 matching is evaluated by the Attack Success Ratio (ASR). ASR indicates the ratio of pairs that are verified as impostor pairs among the genuine pairs after de-identification. Higher ASR indicates better de-identification performance.

The performance of de-identification in 1-to-N matching is evaluated by a threefold cross-validation using 3 face images per person. First, we divide the 600 face images of 200 persons randomly selected from the LFW dataset into three sets, where each set has 200 face images including one image for each person. Next, we select two sets from the three sets, and assign one as the input image set and the other as the registered image set. The input image set and the registered image set contain different face images of the same 200 persons. All the face images in the input image set are de-identified, and they are matched with all the face images in the registered image set. Then, we determine the rank based on the matching scores for each face image in the input image set. The above process is performed for all the six combinations, and the average ranking is obtained for each person. The performance of de-identification in 1-to-N matching is evaluated by the Cumulative Match Characteristic (CMC) curve, Rank-1, and Rank-5. The CMC curve is a cumulative relative frequency distribution for ranks, and indicates the percentage at which the genuine is included up to the rank in the matching results. When the CMC curve is located in the upper left, the face recognition model is accurate, that is, the de-identification performance is low. Rank-1 and Rank-5 correspond to the frequencies of \(\textrm{Rank}=1\) and \(\textrm{Rank}=5\) on the CMC curve, respectively. Lower Rank-1 and Rank-5 indicate higher de-identification performance.

4.4 Ablation study for loss functions

We describe the ablation study of the loss functions used for training HN in the proposed method. We evaluate the effectiveness of each loss function by training HN with different combinations of \(\mathcal {L}_{\textrm{near}}\) and \(\mathcal {L}_{\textrm{far}}\), which control the de-identification performance of the proposed method, as mentioned in Sect. 3.2.3. We consider three combinations of loss functions: (a) \(\mathcal {L}_{\textrm{near}}\), (b) \(\mathcal {L}_{\textrm{far}}\), and (c) \(\mathcal {L}_{\textrm{near}}+\mathcal {L}_{\textrm{far}}\). In all cases, \(\mathcal {L}_{\textrm{rec}}\), \(\mathcal {L}_{\textrm{perc}}\), \(\mathcal {L}_{\textrm{lpips}}\) are used and \(\lambda _{\textrm{rec}}=\lambda _{\textrm{perc}}=\lambda _{\textrm{lpips}}=1.0\). In (a) \(\lambda _{\textrm{near}}=0.30\), in (b) \(\lambda _{\textrm{far}}=0.30\), and in (c) \(\lambda _{\textrm{near}}=\lambda _{\textrm{far}}=0.14\), where these values are empirically determined. The original face image E of the features to be embedded is a face image from the CelebA dataset, and the de-identification performance is evaluated by ASR.

Table 2 shows the evaluation results of image quality and de-identification performance, and Fig. 8 shows the generated de-identified images D for each combination. As shown in Table 2, (b) and (c) have a better balance between image quality and de-identification performance than (a). As shown in Fig. 8, the de-identified image D generated by (b) shows a large change in the region around the nose. The de-identified image D generated by (c) looks natural, with almost no change in the appearance of the cover image C. The purpose of de-identification is not to misidentify the de-identified image D as another person, but to make it difficult to recognize the person in the cover image C. Therefore, the loss function \(\mathcal {L}_{\textrm{far}}\) that separates the facial features extracted from the de-identified image D from those extracted from the cover image C is effective. However, \(\mathcal {L}_{\textrm{far}}\) does not consider facial structure, resulting in unnatural appearance of the de-identified image D. By combining \(\mathcal {L}_{\textrm{far}}\) and \(\mathcal {L}_{\textrm{near}}\), we can generate a natural de-identified image D since the facial structure of the cover image C is considered while the facial features are separated. In the proposed method in the following experiments, both \(\mathcal {L}_{\textrm{far}}\) and \(\mathcal {L}_{\textrm{near}}\) are used in loss functions.

Fig. 8
figure 8

Examples of de-identified images D generated by the proposed method for the combinations of \(\mathcal {L}_{\textrm{near}}\) and \(\mathcal {L}_{\textrm{far}}\): a \(\mathcal {L}_{\textrm{near}}\), b \(\mathcal {L}_{\textrm{far}}\), and c \(\mathcal {L}_{\textrm{near}}+\mathcal {L}_{\textrm{far}}\)

Table 2 Evaluation results of image quality and de-identification performance for the combinations of \(\mathcal {L}_{\textrm{near}}\) and \(\mathcal {L}_{\textrm{far}}\): (a) \(\mathcal {L}_{\textrm{near}}\), (b) \(\mathcal {L}_{\textrm{far}}\), and (c) \(\mathcal {L}_{\textrm{near}}+\mathcal {L}_{\textrm{far}}\)

4.5 Types of images to be embedded

For face image de-identification using the proposed method, features embedded in a face image can be extracted from images other than face images. To demonstrate that the features extracted from face images are suitable for the proposed method, we compare the de-identification performance with that of features extracted from noise images and object images. We train HN to embed features extracted from face images (face), noise images (noise), and object images (object) as shown in Fig. 9, and evaluate the image quality and performance of de-identified images. In (Face), a face image is randomly selected from CelebA and used as E in each epoch of training. In (Noise), a noise image generated by a random number based on the standard normal distribution is used as E in each epoch of training. For (Object), one image is randomly selected from 5000 general object images in the Places365 dataset [44] and used as E in each epoch of training. In all cases, \(\mathcal {L}_{\textrm{rec}}\), \(\mathcal {L}_{\textrm{perc}}\), \(\mathcal {L}_{\textrm{lpips}}\) are used to control image quality and \(\lambda _{\textrm{rec}}=\lambda _{\textrm{perc}}=\lambda _{\textrm{lpips}}=1.0\), where these values are empirically determined. Only \(\mathcal {L}_{\textrm{near}}\) is used to control the de-identification performance, and \(\lambda _{\textrm{near}}=0.30\) for (Face), \(\lambda _{\textrm{near}}=0.26\) for (Noise), and \(\lambda _{\textrm{near}}=0.30\) for (Object), respectively, where these values are empirically determined. In this experiment, LFW is used as the dataset for evaluation, and ASR is used as a metric for evaluating de-identification performance. Table 3 shows the results of evaluating the image quality and de-identification performance of the proposed method for different types of E, and Fig. 10 shows the generated de-identified images. (Face) exhibits higher image quality and de-identification performance than (Noise) and (Object). When (Noise) and (Object) are set to E, the de-identified images are unnatural around the eyes. When (Face) is set to E, the de-identified image is natural and preserves the appearance of the cover image From the above, the proposed method can generate images with high de-identification performance by using E as the face image and embedding its features.

Fig. 9
figure 9

Example of different types of E used to find the optimal E in the proposed method

Table 3 Evaluation results of image quality and de-identification performance of the proposed method for different types of E
Fig. 10
figure 10

Examples of de-identified images generated by the proposed method for different types of E

4.6 Analysis of individuality in de-identified images

In the case that a face image is selected as E, we confirm the individuality of the de-identified image D generated by embedding its features \(f_{\textrm{en}}(E)\) into the cover image C using the genuine pairs of LFW. To analyze the de-identified image D, we use the trained models of HN and EN from Table 2 (c). Let \((G_1,G_2)\) be the genuine pair of LFW, and \(G_1\) is used as the cover image C. We de-identify C using a face image \(E_1\) to be embedded, and generate \(D_1\) using the proposed method. Let \(E_2\) be a different face image of person in \(E_1\) and we perform face recognition between \((D_1, E_2)\). The above process is applied to 3000 genuine pairs of LFW, and the de-identification performance is evaluated by ASR. Note that the person of \(E_1\) and \(E_2\) used in this experiment is not included in the genuine pairs of LFW. Figure 11 shows the face images of \(E_1\) and \(E_2\) used in this experiment. Table 4 shows ASRs before and after de-identification. Before de-identification, i.e., when \((G_1, E_2)\) is used for face recognition, ASR is almost 100 %, and therefore \(G_1\) and \(E_2\) may be from different persons. After de-identification, i.e., when \((D_1, E_2)\) is used for face recognition, ASR is significantly decreased. Although \(D_1\) looks like \(G_1\), the features extracted from \(D_1\) are closer to those from \(E_1\). Therefore, since \(E_1\) and \(E_2\) are the same person, their matching scores between \((D_1, E_2)\) are high, resulting in a lower ASR. In Table 2 (c), the lowest ASR is 77.16%, while in Table 4, ASR is lower than that in all cases, implying that \(D_1\) and \(E_2\) are similar. As a result, the proposed method trains HN so that the facial features of the de-identified image D are close to those of the face image E, and therefore the de-identified image D may be identified by the person in the face image E.

Fig. 11
figure 11

Face images \(E_1\) to be embedded and \(E_2\) of the same person as \(E_1\) used in the experiment of individuality analysis of de-identified images

Table 4 Evaluation results of ASR [%]\(\uparrow\) before and after de-identification

4.7 Types of face images to be embedded

As mentioned above, we have to take into account the privacy of real persons in face image de-identification. Hence, we consider using a face image of a fake person generated by a generative model such as GAN [13] as the face image E. We compare the image quality and de-identification performance when training HN using a real or fake face image as the face image E to evaluate the effectiveness of using a face image of a fake person as E. In this experiment, (Real) is the case where a face image of a real person is used as the face image E, and (Fake) is the case where a face image of a fake person is used. The parameters of the proposed method are \(\lambda _{\textrm{rec}}=\lambda _{\textrm{perc}}=\lambda _{\textrm{lpips}}=1.0\) in both cases, \(\lambda _{\textrm{near}}=\lambda _{\textrm{far}}=0.140\) in (Real), and \(\lambda _{\textrm{near}}=\lambda _{\textrm{far}}=0.125\) in (Fake), where these values are empirically determined. At each epoch of HN training, the face image E is one of the face images randomly selected from the CelebA dataset in (Real) and from the GFSG2 in (Fake), respectively. In the evaluation, the real and fake face images shown in Fig. 12 are used as the face image E, and the de-identification performance is evaluated by ASR.

Table 5 shows the results of evaluating the image quality and de-identification performance of the proposed method for each type of face image E, and Fig. 13 shows the generated de-identified images. From Table 5, (Real) and (Fake) exhibit a similar balance between image quality and de-identification performance. From Fig. 13, both (Real) and (Fake) produce a de-identified image D that preserves the appearance of the cover image C. Therefore, considering the privacy of real persons, it is effective to use a face image of a fake person as the face image E. In the proposed method in the following experiments, a face image of a fake person is used for the face image E.

Fig. 12
figure 12

Face image E for (Real) and (Fake) used in the evaluation

Table 5 Evaluation results of image quality and de-identification performance of the proposed method for each type of face image E
Fig. 13
figure 13

Examples of de-identified images generated by the proposed method for each type of face image E

4.8 Comparison of face image de-identification methods

To demonstrate the effectiveness of the proposed method, we compare its performance with that of the de-identification methods using AEs: LGC [7], TIP-IM [8], and Fawkes [9]. LGC uses \(\epsilon =4.0\) as the hyperparameter that controls the magnitude of the perturbation. TIP-IM uses \(\epsilon =12.0\) as the hyperparameter that controls the magnitude of the perturbation, and \(\gamma =0\) as the weights of MMD [20] that is the loss function to control the appearance. Fawkes sets the mode controlling the magnitude of the perturbation to high. The parameters of the proposed method are \(\lambda _{\textrm{rec}}=\lambda _{\textrm{perc}}=\lambda _{\textrm{lpips}}=1.0\) and \(\lambda _{\textrm{near}}=\lambda _{\textrm{far}}=0.125\), where these values are empirically determined. In this experiment, the de-identification performance is evaluated by ASR, Rank-1, and Rank-5.

Table 6 shows the results of image quality and ASR and Table 7 shows the results of Rank-1 and Rank-5 for each method. Figure 14 shows the relationship between image quality and de-identification performance, Fig. 15 shows the CMC curve, and Fig. 16 shows examples of de-identified images for each method. Note that Fig. 14 shows the results when changing the parameters for LGC and the proposed method: \(\epsilon =[1.0,2.0,3.0,4.0,5.0,6.0,7.0]\) in LGC and \(\mathcal {L}_{\textrm{near}} = \mathcal {L}_{\textrm{far}} = [0. 75,1.00,1.25,1.50]\) in the proposed method. From Tables 6, 7, and Fig. 15, the proposed method has higher ASR and lower Rank-1 and Rank-5 for all face recognition models than the conventional methods, resulting in better de-identification performance, and the proposed method also produces de-identified images with high image quality due to its low LPIPS. Focusing on the LPIPS of Fig. 14, the proposed method exhibits a better balance between image quality and de-identification performance than the conventional methods. From Fig. 16, the conventional methods generate de-identified images that are noisy and have unnatural appearance, while the proposed method generates de-identified images that are less noisy and look natural regardless of the face pose and skin color.

Table 6 Evaluation results of image quality and ASR of the conventional methods and the proposed method
Table 7 Evaluation results of Rank-1 and Rank-5 of the conventional methods and the proposed method
Fig. 14
figure 14

Relationship between image quality and de-identification performance for the conventional methods and the proposed method, where the parameters for LGC and the proposed method are changed

Fig. 15
figure 15

CMC curves for the conventional methods and the proposed method

Fig. 16
figure 16

Examples of de-identified images generated by the conventional methods and the proposed method

We analyze how each method changed the cover image C for de-identification via the difference between the cover image C and the de-identified image D. Figure 17 shows the result of amplifying the difference between the cover image C and the de-identified image D by a factor of 10 for each method. The de-identified image D generated by the conventional methods is perturbed to the whole face, while the de-identified image generated by the proposed method changes facial parts such as eyes, nose, and mouth. The proposed method performs de-identification by changing facial parts such as eyes, nose, and mouth, in which face recognition models focus on when identifying individuals. Hence, the proposed method can preserve the appearance of the cover image C and de-identify face images D against unknown face recognition models.

Fig. 17
figure 17

Amplified difference between the cover image C and the de-identified image D by a factor of 10 for each method

4.9 Robustness against image processing attacks

We verify whether the effect of de-identification is reduced when image processing is applied to the de-identified images D generated by each method. In this experiment, we use Median Blur [28], Gaussian Blur [29], JPEG Encoding [29], and Bit-depth Reduction [28] as image processing methods applied to the de-identified image D. The kernel size is set to 5 for Median Blur, the kernel size is set to 5 and the standard deviation to 2 for Gaussian Blur, the quality level of JPEG is set to 10 for JPEG Encoding, and the bit-depth is set to 3 for Bit-depth Reduction. Table 8 and Fig. 18 show the results of ASR for the de-identified images after applying image processing, where “Original” indicates ASR when face images are not de-identified, and “None” indicates ASR when no image processing is applied. In the case of “Original”, i.e., face images are not de-identified, ASR slightly increases. When image processing is applied to de-identified images generated by conventional methods, ASR decreases. In particular, Gaussian Blur is most effective in reducing the effect of de-identification, since the Gaussian Blur significantly reduces ASR for all conventional methods. When image processing is applied to the de-identified images generated by the proposed method, ASR is almost unchanged. Therefore, the proposed method is more robust against image processing than the conventional methods. The conventional method performs de-identification by adding pixel-level perturbations, and thus the effect of de-identification is reduced by image processing. On the other hand, the proposed method performs de-identification by embedding facial features of other persons, and thus the effect of image processing is limited.

Table 8 Results of ASR for the de-identified images after applying image processing, where “Original” indicates ASR when face images are not de-identified, and “None” indicates ASR when no image processing is applied
Fig. 18
figure 18

Plots of ASR for the de-identified images after applying image processing, where “Original” indicates ASR when face images are not de-identified, and “None” indicates ASR when no image processing is applied

5 Conclusion

We have proposed a face image de-identification method based on feature embedding. The proposed method embeds facial features extracted from the generated face image into the input face image to generate a de-identified image that is not recognized as a person in the input face image by the face recognition models, while preserving the appearance of the input face image. Through a set of experiments on the public datasets: CelebA [15], GFSG2 [16], and LFW [17], we demonstrated the effectiveness of the proposed method in face image de-identification compared to conventional methods. We have also demonstrated that the de-identified images generated by the proposed method are robust to image processing unlike conventional AE-based methods. We assume that still images are input in this paper. In the future, we will investigate feature embedding methods for video images for the purpose of developing a practical system.

Availability of data and materials

Publicly available datasets were used in the experiments. The CelebA dataset can be found here: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on February 26, 2024). The GFSG2 dataset can be found here: https://drive.google.com/drive/folders/1-5oQoEdAecNTFr8zLk5sUUvrEUN4WHXa (accessed on February 26, 2024). The LFW dataset can be found here: https://vis-www.cs.umass.edu/lfw/ (accessed on February 26, 2024). Publicly available codes were used in the experiments. LGC can be found here: https://github.com/ShawnXYang/Face-Robustness-Benchmark (accessed on February 26, 2024). TIP-IM can be found here: https://github.com/ShawnXYang/TIP-IM (accessed on February 26, 2024). Fawkes can be found here: https://github.com/Shawn-Shan/fawkes (accessed on February 26, 2024).

Notes

  1. https://drive.google.com/drive/folders/1-5oQoEdAecNTFr8zLk5sUUvrEUN4WHXa.

  2. https://github.com/Shawn-Shan/fawkes.

  3. https://github.com/nizhib/pytorch-insightface/blob/main/insightface/iresnet.py.

Abbreviations

AEs:

Adversarial examples

ASR:

Attack success ratio

CelebA:

Large-scale CelebFaces attributes

CMC:

Cumulative match Characteristic

CNN:

Convolutional neural network

EN:

Extracting network

GAN:

Generative adversarial network

GFSG2:

Generated faces by StyleGAN2

HN:

Hiding network

LFW:

Labeled faces in the wild

LGC:

Landmark-guided cutout

LPIPS:

Learned perceptual image patch similarity

MIM:

Momentum iterative method

MMD:

Maximum mean discrepancy

MSE:

Mean squared error

PSNR:

Peak signal-to-noise ratio

SNS:

Social networking services

SSIM:

Structural Similarity Index Measure

TIP-IM:

Targeted identity-protection iterative method

References

  1. S.Z. Li, A.K. Jain, Handbook of Face Recognition (Springer, Berlin, 2011)

    Book  Google Scholar 

  2. V. Mirjalili, A. Ross, What else does your biometric data reveal? A survey on soft biometrics. IEEE Trans. Inf. Forensics Secur. 11(3), 441–467 (2016)

    Article  Google Scholar 

  3. B. Meden, P. Rot, P. Terhörst, N. Damer, A. Kuijper, W.J. Scheirer, A. Ross, P. Peer, V. S̆truc, Privacy-enhancing face biometrics: a comprehensive survey. IEEE Trans. Inf. Forensics Secu. 16, 4147–4183 (2021)

    Article  Google Scholar 

  4. S. Ribaric, A. Ariyaeeinia, N. Pavesic, Deidentification for privacy protection in multimedia content: a survey. Sig. Process. Image Commun. 47, 131–151 (2016)

    Article  Google Scholar 

  5. E.M. Newton, L. Sweeney, B. Malin, Preserving privacy by de-identifying face images. IEEE Trans. Knowl. Data Eng. 17, 232–243 (2005)

    Article  Google Scholar 

  6. M. Boyle, C. Edwards, S. Greenberg, The effects of filtered video on awareness and privacy. In: Proceedings the 2000 ACM Conference on Computer Supported Cooperative Work, pp. 1–10 (2000)

  7. X. Yang, D. Yang, Y. Dong, H. Su, W. Yu, J. Zhu, RobFR: benchmarking adversarial robustness on face recognition. CoRR arXiv:abs/2007.04118, pp. 1–28 (2020)

  8. X. Yang, Y. Dong, T. Pang, H. Su, J. Zhu, Y. Chen, H. Xue, Towards face encryption by generating adversarial identity masks. In: Proceedings of International Conference on Computer Vision, pp. 3897–3907 (2021)

  9. S. Shan, E. Wenger, J. Zhang, H. Li, H. Zheng, B.Y. Zhao, Fawkes: protecting privacy against unauthorized deep learning models. In: Proceedings of the 29th USENIX Security Symposium, pp. 1589–1604 (2020)

  10. M.H. Khojasteh, N.M. Farid, A. Nickabadi, GMFIM: a generative mask-guided facial image manipulation model for privacy preservation. Comput. Graph. 112, 81–91 (2023)

    Article  Google Scholar 

  11. H. Uchida, N. Abe, S. Yamada, DeDiM: de-identification using a diffusion model. In: Proceedings of International Conference on Biometrics Special Interest Group, pp. 1–8 (2022)

  12. I. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples. In: Proceedings of International Conference on Learning Representations, pp. 1–11 (2015)

  13. I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks. In: Proceedings of IEEE Conference on Neural Information Processing Systems, pp. 2672–2680 (2014)

  14. J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models. In: Proceedings of IEEE Conference on Neural Information Processing Systems, pp. 6840–6851 (2020)

  15. Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision, pp. 3730–3738 (2015)

  16. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of StyleGAN. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)

  17. G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report 07–49, University of Massachusetts, Amherst (2007)

  18. G. Hanawa, K. Ito, T. Aoki, Face image de-identification based on feature embedding for privacy protection. In: Proceedings of International Conference on Biometrics Special Interest Group, pp. 1–6 (2023)

  19. Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, J. Li, Boosting adversarial attacks with momentum. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 9185–9193 (2018)

  20. K.M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Scholkopf, A.J. Smola, Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 49–57 (2006)

    Article  Google Scholar 

  21. Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  22. T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

  23. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models resistant to adversarial attacks. In: Proceedings of International Conference on Learning Representations, pp. 1–23 (2018)

  24. A. Kurakin, I.J. Goodfellow, S. Bengio, Robustness may be at odds with accuracy. In: Proceedings of International Conference on Learning Representations, pp. 1–23 (2019)

  25. F.V. Massoli, F. Carrara, G. Amato, F. Falchi, Detection of face recognition adversarial attacks. Comput. Vis. Image Understand. 202, 1–11 (2021)

    Article  Google Scholar 

  26. A. Goel, A. Singh, A. Agarwal, M. Vatsa, R. Singh, SmartBox: benchmarking adversarial detection and mitigation algorithms for face recognition. In: Proceedings of International Conference on Biometrics Theory, Applications and Systems, pp. 1–7 (2018)

  27. A. Agarwal, R. Singh, M. Vatsa, N. Ratha, Are image-agnostic universal adversarial perturbations for face recognition difficult to detect? In: Proceedings of International Conference on Biometrics Theory, Applications and Systems, pp. 1–7 (2018)

  28. W. Xu, D. Evans, Y. Qi, Feature squeezing: detecting adversarial examples in deep neural networks. CoRR arXiv:abs/1704.01155, pp. 1–15 (2017)

  29. K. Alexey, I.J. Goodfellow, B. Samy, Adversarial examples in the physical world. In: Proceedings of International Conference on Learning Representations, pp. 1–14 (2017)

  30. S. Baluja, Hiding images in plain sight: deep steganography. Proc. Adv. Neural Inf. Process. Syst. 30, 2069–2079 (2017)

    Google Scholar 

  31. K. Ito, T. Kozu, H. Kawai, G. Hanawa, T. Aoki, Cancelable face recognition using deep steganography. IEEE Trans. Biometr. Behav. Identity Sci. 6, 87–102 (2023)

    Article  Google Scholar 

  32. O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for biomedical image segmentation. In: Proceedings of Internationall Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241 (2015)

  33. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  34. J. Johnson, A. Alahi, F. Li, Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of European Conference on Computer Vision, pp. 694–711 (2016)

  35. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR arXiv:abs/1409.1556, pp. 1–14 (2015)

  36. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  37. R. Zhang, P. Isola, A.A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

  38. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Proc. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)

    Google Scholar 

  39. D. Kingma, J. Ba, Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, pp. 1–15 (2015)

  40. J. Deng, J. Guo, S. Zafeiriou, ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4685–4694 (2019)

  41. C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, inception-ResNet and the impact of residual connections on learning. Proc. AAAI Conf. Artif. Intell. 31, 4278–4284 (2017)

    Google Scholar 

  42. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

  43. H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, W. Liu, CosFace: large margin cosine loss for deep face recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018)

  44. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

No additional acknowledgments.

Funding

This work was supported, in part, by JSPS KAKENHI Grant Numbers 21H03457 and 23H00463.

Author information

Authors and Affiliations

Authors

Contributions

Funding acquisition, K.I. and T.A; methodology, G.H.; supervision, K.I. and T.A.; writing—original, G.H.; writing—review and editing, K.I. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Koichi Ito.

Ethics declarations

Competing interests

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hanawa, G., Ito, K. & Aoki, T. Face image de-identification based on feature embedding. J Image Video Proc. 2024, 25 (2024). https://doi.org/10.1186/s13640-024-00646-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00646-z

Keywords