Skip to main content

Subjective performance assessment protocol for visual explanations-based face verification explainability

Abstract

The integration of Face Verification (FV) systems into multiple critical moments of daily life has become increasingly prevalent, raising concerns regarding the transparency and reliability of these systems. Consequently, there is a growing need for FV explainability tools to provide insights into the behavior of these systems. FV explainability tools that generate visual explanations, e.g., saliency maps, heatmaps, contour-based visualization maps, and face segmentation maps, show promise in enhancing FV transparency by highlighting the contributions of different face regions to the FV decision-making process. However, evaluating the performance of such explainability tools remains challenging due to the lack of standardized assessment metrics and protocols. In this context, this paper proposes a subjective performance assessment protocol for evaluating the explainability performance of visual explanation-based FV explainability tools through pairwise comparisons of their explanation outputs. The proposed protocol encompasses a set of key specifications designed to efficiently collect the subjects’ preferences and estimate explainability performance scores, facilitating the relative assessment of the explainability tools. This protocol aims to address the current gap in evaluating the effectiveness of visual explanation-based FV explainability tools, providing a structured approach for assessing their performance and comparing with alternative tools. The proposed protocol is exercised and validated through an experiment conducted using two distinct heatmap-based FV explainability tools, notably FV-RISE and CorrRISE, taken as examples of visual explanation-based explainability tools, considering the various types of FV decisions, i.e., True Acceptance (TA), False Acceptance (FA), True Rejection (TR), and False Rejection (FR). A group of subjects with variety in age, gender, and ethnicity was tasked to express their preferences regarding the heatmap-based explanations generated by the two selected explainability tools. The subject preferences were collected and statistically processed to derive quantifiable scores, expressing the relative explainability performance of the assessed tools. The experimental results revealed that both assessed explainability tools exhibit comparable explainability performance for FA, TR, and FR decisions with CorrRISE performing slightly better than FV-RISE for TA decisions.

1 Introduction

In the digital age, identity verification has evolved as a fundamental and indispensable operational capability, playing a pivotal role in ensuring security and fostering trust for various applications that have become central to our daily lives. As this technology advances, the need for more robust identity verification mechanisms has become more intense, encompassing a wider range of daily activities, such as border control, law enforcement, and social interactions [1]. In response to this imperative, biometrics, and more specifically, face recognition, are being harnessed to establish user-friendly, robust, and secure authentication mechanisms. In contrast to conventional personal authentication systems, e.g., PIN code, ID card and password, face recognition focuses on identifying individuals based on their verifiable and unique facial characteristics, thus raising the security level [2].

The impressive technical advances and performance improvements reached using Deep Learning (DL) technologies for Face Verification (FV), a specific type of face recognition where an individual's identity is verified/validated using his/her face, the so-called probe image, against some pre-collected faces from the same individual, the so-called gallery images, have boosted the interest for the deployment of FV solutions [3]. However, FV performance gains are commonly associated with higher DL modeling complexity, a rather limited understanding of the decision-making process and potential age, gender, and ethnic biases [4,5,6], what may be difficult to accept in some application domains. To address these limitations, explainability tools have been developed to elucidate and enhance the understanding of the decision-making mechanisms in DL-based FV systems [7,8,9]. These explainability tools are generally categorized into two key types, notably ante hoc and post hoc explainability. Ante hoc explainability involves integrating explainability directly into FV systems during their development, revealing the reasoning behind their decisions. On the other hand, post hoc explainability focuses on interpreting the decisions of already trained FV systems; this type of explainability does not alter the original FV system but instead uses various techniques to ensure more transparency on how decisions are made. Regardless of whether ante hoc or post hoc explainability is used, these tools play a crucial role, particularly in critical FV application domains by highlighting the face features/regions that contribute most significantly to the FV system's decisions. In doing so, the explainability tools facilitate a more transparent and accountable integration of DL-based FV systems into multiple real-world scenarios and foster their societal acceptance [10, 11].

Most FV explainability research focuses on developing and improving explainability tools. For ante hoc explainability, optimizing the trade-off between the FV performance and the associated explainability power is critical. Often, these two capabilities need to be balanced since more interpretability may come at the cost of less FV performance [12]. With the increasing number of explainability tools available, a pressing requirement emerges for explainability performance assessment methodologies, to evaluate the power and usefulness of the explanations provided and determine whether and to what extent the offered explanations achieve the targeted objectives [13]. In addition, FV explainability performance assessment can be used to compare available FV explainability tools, DL based or not, notably to identify those providing the most appropriate and useful explanations. The explainability assessment goal is, therefore, to ascertain the extent to which the provided explanations align with the targeted objectives, offering decision transparency and insights into the complex workings of DL-based FV systems.

Although there are currently no commonly agreed solutions for FV explainability performance assessment, a few authors have attempted to formulate explainability performance assessment metrics for the specific FV explainability tools proposed by themselves [4, 9, 14,15,16]. The relatively unexplored landscape of FV explainability performance assessment can be largely attributed to the diversity of explainability tools proposed, such as face features relevance [17, 18] and saliency/heatmaps visualization [8, 9, 19], making the comparison of different strategies to generate FV explanations more difficult. Often, these comparisons are performed for explainability tools adopting the same approach, e.g., heatmap based [9, 19], and not for explainability tools in general.

There are two main categories of performance assessment techniques that can be applied to FV explainability [20], notably objective evaluation, which includes quantitative measures associated to objective metrics and automated approaches to perform the explainability performance assessment; and subjective evaluation, which includes protocols evaluating the explainability performance with human-in-the-loop approaches by involving end-users to collect their feedback and judgment; this type of evaluation is also known as human-centered evaluation.

Both types of performance assessment techniques offer distinct advantages and disadvantages, depending on the context and what has to be assessed. Objective evaluation relies on quantifiable metrics, which enable consistent and reproducible assessment across various systems and contexts. This approach minimizes human biases by depending solely on measurable data and can be often automated, saving time and reducing the need for extensive human involvement in the evaluation process [21]. However, objective evaluation may oversimplify complex explainability aspects by reducing them to mere numerical values, potentially missing critical qualitative factors [22]. Additionally, interpreting objective evaluation results often requires expertise, complicating the process for non-experts and limiting its practical applicability in real-world scenarios [13]. On the other hand, subjective evaluation focuses on the human perspective, enabling user-centric assessments that directly capture how generated explanations align with human reasoning. This type of evaluation ensures contextual relevance, as it can be tailored to the specific needs of user groups and application domains. It also yields rich qualitative insights that objective evaluation often misses, providing a deeper understanding of user experiences and perceptions [21, 23, 24]. However, subjective evaluation is susceptible to human biases and variability, which might affect the reliability and consistency of the results. Additionally, conducting subjective evaluations can be time consuming and resource intensive, particularly for large-scale studies [25, 26].

The existing body of research on FV explainability performance assessment has only focused on objective evaluation techniques. There has been a conspicuous absence in the adoption of subjective evaluation techniques. Yet, subjective evaluation has the advantage of including the end-users in the evaluation procedure, thus allowing to check whether FV systems are making decisions in a similar manner as human subjects would do, accepting the humans may play a reference role. As such, and even if it can be valuable to use explainability tools providing a visualization of the behavior and clues utilized by the FV system, independently of being aligned or not with human expectations, a subjective evaluation strategy allows to assess how much the generated FV explanations are aligned with the human interpretation, providing a reasonable methodology for comparing the performance of multiple FV explainability tools. This methodology not only enhances the understanding of how FV systems may have arrived at their decisions but also ensures that these systems meet the practical needs of their end-users, bridging the gap between DL-based FV systems and human cognition in the decision-making processes. Consequently, incorporating subjective evaluation into FV explainability performance assessment can lead to the development of more interpretable and acceptable FV systems.

In this context, this paper proposes the first subjective, human-centered, explainability performance assessment protocol for FV explainability tools generating visual explanations; this type of assessment is especially critical for DL-based FV solutions but may also be used for non-DL-based FV solutions and even to compare the performance of these two alternative approaches. A visual explanation is here defined as a two-dimension graphical representation which serves to visualize the regions from an input face image contributing more or less to the FV decision. These visual explanations may encompass various forms, notably saliency [27, 28], heat [9, 29], contour-based visualization [8], and face segmentation maps [30, 31], among others. The proposed protocol, coined as the Face Verification eXplainability Performance Assessment Protocol (FVX-PAP), evaluates the relative explainability performance of two or more FV explainability tools by comparing their output visual explanations in pairs, thus adopting a pairwise comparison assessment protocol [32]. The subjects' preferences are collected under an appropriate set-up and subjected to statistical processing to derive quantifiable scores that express the relative explainability performance of the evaluated FV explainability tools. To the best of the authors’ knowledge, the proposed FVX-PAP protocol is the first of its kind for FV explainability performance assessment. This protocol can be used to rather broadly assess the explainability performance of explainability tools generating visual explanations, even if using rather different formats, ensuring comprehensive evaluation across different tool types. In this paper, the proposed FVX-PAP protocol is adopted for a proof-of-concept experiment targeting assessing two post hoc heatmap-based FV explainability tools, just as an example of its potential usage, notably FV-based Randomized Input Sampling for Explanation (FV-RISE) [9] and Correlation-based Randomized Input Sampling for Explanation (CorrRISE) [16]. The experimental results revealed that both assessed explainability tools exhibit comparable explainability performance for FA, TR, and FR decisions with CorrRISE performing slightly better than FV-RISE for TA decisions.

The remainder of this paper is organized as follows: Sect. 2 reviews the current state-of-the-art on FV explainability performance assessment. Section 3 presents a comprehensive description of the proposed FVX-PAP protocol. Section 4 addresses the detailed specification of a subjective assessment experiment performed using the proposed protocol to evaluate two heatmap-based FV explainability tools, including the FV explainability tools to be assessed. Section 5 reports the obtained subjective experiment results and the associated processing and analysis. Finally, Sect. 6 concludes the paper and discusses directions and perspectives for future work.

2 Background work

In recent years, there has been a growing interest in the use of explainability tools generating visual explanations to explain FV systems’ decision-making. Despite the notable advances achieved by such explainability tools in enhancing the transparency and interpretability of FV systems, a critical gap persists in the evaluation of their performance. Unfortunately, the assessment of these tools has often been overlooked, leading to inherent limitations in the validation of their efficacy. The current research on visual explanation-based FV explainability performance assessment adopts objective evaluation approaches, with only a handful of works dedicating efforts to assess the performance of their proposed FV explainability tools.

In [9], the recall metric, defined as the ratio between the number of true acceptances and the sum of the number of true acceptances with the number of false rejections, is used to assess the efficacy of the FV-RISE tool through its variation when the probe face images are manipulated through deletion and insertion processes used with different percentages. The deletion and insertion processes entail sequentially manipulating the probe face images by either removing, in the deletion process, or adding, in the insertion process, a growing percentage of pixels in the probe face image, according to the heatmap pixels relevance and measuring the resulting impact on the FV system’s recall. An accurate heatmap is expected to highlight the most FV decision-making relevant face regions in a compact way. In this context, the faster the FV system’s recall drops/rises with the percentage used for the deletion and insertion processes, respectively, the more accurate is the heatmap in terms of explainability power. The evaluation experiment was performed on a subset of the Labeled Faces in the Wild (LFW) [33] to assess the proposed explainability tool performance in comparison with four alternative relevant explainability tools. Analogous deletion and insertion processes were used in [16] to assess the explainability performance of the CorrRISE explainability tool by examining the FV system’s accuracy changes after modifying the input image according to the importance “saliency” maps, similar to heatmaps, generated by this tool. Since CorrRISE generates saliency maps for probe and gallery face images, the deletion and insertion processes are applied to both face images and the similarity score for the new probe–gallery pair is computed at various percentages of the pixels removed/added in both the probe and gallery face images. In [34], the FV system’s recall and true negative rate metrics variation is examined through the same deletion process, on the probe face images according to the importance of generated heatmaps, to assess a vision transformers-based FV explainability tool in comparison with state-of the-art tools.

In [27], the explainability performance of six explainability tools is assessed and compared by using as metric the variation of the percentage of face images correctly classified through a hiding game process. More specifically, the process entails iteratively obscuring a larger percentage of the least important pixels, sorted according to the saliency map generated, in the face image to perform after the face recognition with the obscured face image. The evaluation performance was performed using a subset of 1000 face images collected from the VGG-face dataset [35] and the percentage of face images correctly classified was measured for each percentage of least important pixels obscured. Ideally, the more accurate the saliency map, the slower the percentage of face images correctly classified drops, as only the most critical pixels are maintained. Rather than straightforwardly obscuring the least important pixels, the authors proposed to blur the pixels with a Gaussian kernel, a technique often considered intuitive for representing missing information. In [28], the variation of the FV system’s accuracy was measured through the same hiding game process to compare a proposed saliency map-based explainability tool with alternative, state-of-the-art tools, notably GradCAM + + [36], Gradient [37], xFace [7], and the tool proposed in Stylianou et al. [38]. The experimental results conducted on three different face datasets, i.e., LFW [33], AgeDB-30 [39], and Celebrities in Frontal-Profile (CFP) [40], show that the saliency maps produced by the proposed explainability tool are the most accurate for the three datasets when different percentages of least important pixels are blurred out.

While the power of objective assessment of FV explainability tools is essential for understanding the decision-making of FV systems, particularly in measuring how accurately and consistently the explanations reflect the underlying system's processes and decisions, objective assessment should be complemented by subjective evaluation, which examines how well the explanations align with human reasoning and whether they are easily understood by human users. This aspect captures the intuitiveness and interpretability of the explanations from a human perspective, reflecting the users' perception and acceptance. This dual approach ensures a comprehensive evaluation of FV explainability tools, balancing human-centric interpretability with objective performance. In this context, this paper proposes the first subjective performance assessment protocol to evaluate FV explainability tools producing visual explanations.

3 Proposed face verification explainability performance assessment protocol

The proposed FVX-PAP protocol offers an innovative approach to assess the performance of FV explainability tools producing visual explanations. A subjective pairwise comparison methodology [32] is adopted, allowing to get feedback on the comparative accuracy and effectiveness of visual explanations generated by alternative FV explainability tools. The gathered pairwise comparison subjects’ preferences are subjected to statistical processing, to quantitatively estimate the explainability performance scores and assess the relative performance of the assessed tools. Since FVX-PAP is a pairwise comparison-based assessment protocol, at least two FV explainability tools producing some type of visual explanations are required to perform the assessment.

The protocol may be used for any FV explainability tool that generates visual explanations to explain the decision-making process. Figure 1 presents illustrative instances of several potential visual explanations. More specifically, Fig. 1a illustrates a face heatmap, which is a combination of a heatmap and the corresponding face luminance. In this context, a heatmap is defined as a color image in the “Hue, Saturation, Value” (HSV) color space where the Hue wheel of colors is used to represent how much a given pixel contributes to explain the FV decision. The color variation in the heatmap gives an intuitive visual representation about the pixels’ importance for the performed FV task and associated decision. Figure 1b illustrates a contour-based visualization map wherein the face image itself is combined with the heatmap contours to avoid changing the face details. A contour is here defined as a continuous line corresponding to a heatmap boundary with the same heatmap value. Figure 1c illustrates a face segmentation map where the pixels contributing to a FV decision are identified. Finally, Fig. 1d illustrates a similarity and dissimilarity map used to highlight the similar and dissimilar face regions contributing to a FV decision using two distinct colors, e.g., pink-colored face regions indicate similarity regions, while blue-colored face regions indicate dissimilarity regions. While the proposed FVX-PAP protocol is primarily designed for pairwise comparisons of tools producing the same or similar type of visual explanations, it can be used to compare tools with different visual explanations, such as heatmaps and contour-based visualization maps. This extension involves ensuring that the comparative criteria are appropriately defined and subjects are well informed about the nature of visual explanations being compared. Thus, FVX-PAP is flexible and can accommodate a variety of visual explanation types for a comprehensive assessment.

Fig. 1
figure 1

Examples of different types of FV visual explanations: a heatmap; b contour-based visualization map; c segmentation map; d similarity and dissimilarity map

While the proposed FVX-PAP protocol may be used for any type of visual explanations, this paper will report as proof-of-concept an experiment where FVX-PAP is exercised by assessing FV explainability tools generating heatmaps as visual explanations. The heatmaps generated by the selected FV explainability tools are converted into face heatmaps, by combining the heatmap explanations with the corresponding face luminance; see this conversion at the “Explainability” segment of Fig. 2. The conversion to face heatmaps is critical to help the test subjects more easily understand the correspondence between the heatmap and the face positions.

Fig. 2
figure 2

FVX-PAP protocol workflow (using heatmaps as examples of visual explanations)

The proposed assessment protocol may be used only for acceptance or rejection FV decision types or for both types where an acceptance corresponds to a positive validation of the face, i.e., the claimed identity is accepted, and a rejection corresponds to a negative validation of the face, i.e., the claimed identity is not accepted.

To ensure the robustness of the FVX-PAP protocol and the reliability and reproducibility of the results, the proposed protocol is defined in detail for the multiple aspects that are critical for subjective assessment. For better understanding, Fig. 2 shows the workflow of the proposed FVX-PAP protocol, and the subsequent subsections will describe in detail the main steps involved. The workflow in the "Explainability" segment in Fig. 2 may vary depending on the type of visual explanation generated; in the figure, heatmaps are used just as an example and this will happen again later when illustrating components of the proposed protocol, e.g., the graphical user interfaces.

3.1 Pairwise comparison-based subjective assessment

The subjective assessment is conducted through a pairwise comparison, where visual explanations generated by two different FV explainability tools are presented side by side; even if more tools are part of an assessment experiment, they shall be always compared in pairs. Then, a group of subjects is presented with a relevant selection of visual explanation pairs, and given the task to express their preference regarding the utility of the two visual explanations to explain the obtained FV decision. The expressed subjects’ preference is largely influenced by how well the visual explanations align with their understanding and expectations (see “Pairwise comparison-based subjective assessment” segment of Fig. 2).

The choice for a pairwise comparison subjective evaluation methodology [32] is largely motivated by the unavailability of a ground truth for the visual explanations, thus preventing the adoption of a double stimulus method [32] where the visual explanations under assessment are compared with a reference, and the difficult for the subjects to make an absolute assessment of the visual explanations, thus preventing the adoption of a single stimulus [32] method without any comparison. Using the pairwise comparison methodology enhances the reliability of the subjective test due to the simplicity of this ranking task, coupled with the efficacy of the side-by-side comparison strategy. Moreover, instead of adopting any discrete or continuous scaling, the pairwise comparison simplifies the subjective evaluation process by allowing subjects to express their preferences using a binary or ternary ranking scale, i.e., “better A,” “better B,” or “A equivalent to B.” The adoption of the ternary ranking scale removes the pressure on the subjects to absolutely select a better visual explanation if no clear distinction exists between the visual explanations, e.g., both visual explanations are equally good or bad. Moreover, there is no time limit for the subjects to compare the visual explanations considering the potential difficulty of the task. Overall, the adopted subjective assessment protocol is rather intuitive and simple for a task that is many times not easy, and thus, it is expected to lead to more accurate and reliable results and conclusions.

3.2 Subject selection

The panel of subjects to be recruited for performing the subjective assessment in the context of the FVX-PAP protocol needs to satisfy a set of key requirements for the collected ranks to be meaningful and statistically representative [41, 42]:

  • Subjects shall ideally be selected from the general population, thus non-experts.

  • Subjects shall ideally include a variety in age (with preference between 18 and 65 years old), gender, and ethnicity.

  • Subjects shall have normal visual acuity; corrective lenses may be used, if needed, either glasses or contact lenses, but should not have multiple focal lengths [41, 42].

  • Subjects shall demonstrate a normal color vision, e.g., by fulfilling a Ishihara plates test [43] or equivalent.

  • Subjects shall not include evaluators who prepared and categorized the face heatmaps or instructed the subjects for the experiments being conducted.

  • A minimum of 15 subjects is commonly asked for the statistical significance of the final conclusions. Since the results from some subjects may be disregarded if these subjects are classified as outliers, it is advisable to recruit more than 15 subjects [42]. More generally, the number of subjects to recruit can be conditioned by its demographic characteristics and the need to generalize to a larger population.

3.3 Environment conditions

To ensure the reliability and reproducibility of results and conclusions, a set of environmental conditions are recommended [41, 42]:

  • Different environments can be adopted to perform the FVX-PAP protocol experiments, such as laboratory or home.

  • Environments’ walls and ceilings do not require a specific color but shall not cause distraction, e.g., through textures or reflections, that may affect the subject’s vision.

  • No critical lighting conditions are required during the experiments; however, appropriate lighting conditions, approximated to daytime lighting, are desirable to help the subject in distinguishing the colors in the visual explanations.

  • Subjects shall be seated in a comfortable chair to ensure they are as comfortable as possible during the experiment.

  • Subjects shall be seated in a central position with respect to the display area where the visual explanations are presented.

  • The viewing distance, which may vary according to the subject visual acuity, is not controlled during the experiment. Thus, the subject can choose a suitable viewing distance to better visualize the visual explanations and may change it during the experiment.

  • The display screen must be large enough to have high discriminative power.

3.4 Graphical user interface

A custom-built software application shall be used to perform the assessment using the FVX-PAP protocol. The display characteristics to run the assessment experiment shall be specified and communicated to the subject. Specifically, to enable a correct visualization of the several graphical user interface components, the subject shall be instructed to both calibrate the display resolution and adjust the scaling of the monitor.

The graphical user interface must display all relevant components to efficiently perform the subjective assessment. Two matching but slightly different graphical user interface layouts are used for the training and test sessions. The graphical user interface for the test session shall include the following components, see Fig. 3 using heatmap explanations as example:

  1. 1)

    Probe–gallery face pair—The original probe and gallery face images used for the FV process are displayed side by side at the top left of graphical user interface.

  2. 2)

    Visual explanation pair—The pair of visual explanations obtained with two different explainability tools is displayed at the right side of the graphical user interface and near each other to allow an easier comparative assessment, clearly identified as “Explainability Map A” and “Explainability Map B.”

  3. 3)

    Task request text—A question positioned at the bottom left of the subject interface, asking the subject to express the relative preference/rank for the displayed visual explanations according to the probe–gallery face pair and the type of FV decision made e.g., acceptance or rejection.

  4. 4)

    Ranking buttons—Three buttons are positioned at the bottom of the right side of the graphical user interface, to allow the subject to express his/her relative preference for the visual explanation that best describes the type of FV decision considered. This preference is expressed using a ternary ranking scale: “A better than B,” “A and B equivalent,” or “B better than A,” to keep the subject’s task as simple as possible.

Fig. 3
figure 3

Test session interface example according to the FVX-PAP protocol (using heatmaps as examples of visual explanations)

The graphical user interface for the training session exposes the relevant components, mimicking the graphical user interface layout of the test session, except for the “Task request text” component. More specifically, instead of requesting the subject to express the relative preference, a descriptive text is used to elucidate why one visual explanation may be favored over the other according to the type of FV decision being made and the highlighted face regions. The main aim of such descriptive text is to provide guidance on how it is expected the subjects should analyze and rank the visual explanations. Figure 4 illustrates an example of the training session subject interface using heatmap explanations as example.

Fig. 4
figure 4

Training session interface example according to the FVX-PAP protocol (using heatmaps as examples of visual explanations)

To better visualize the visual explanations to be compared, the graphical user interface shall use the entire display screen. In addition, since some visual explanations may have a low resolution, they may be spatially up-sampled or super-resolved to a higher resolution, always using the same spatial filter for the visual explanations of all explainability tools under assessment. The graphical user interface background shall be 50% gray (corresponding to Y = U = V = 128) to reduce eye distraction.

3.5 Training session

Before starting the FVX-PAP test session, a list of written instructions shall be given to the subject to ensure the full understanding of the task to be performed, and to make sure all subjects received the same information regarding the test. The written instructions shall describe the type of the assessment, i.e., the task, explain what the subject is going to see and evaluate, and how to express the relative preference/rank. The subjects should be given enough time to quietly read the instructions.

Once the instructions are read and understood, the subjects are presented with a training session to get familiar with the subjective assessment task, the application’s graphical user interface, the visual explanations, ranking options, and the expected ranking with an explanation; the suggested rankings and explanations are important to harmonize the subjects ranking criteria since the task is new for them and not that easy. In this way, the subjects are made acquainted with the ranking procedure and the kind of the visual explanations to be pairwise compared.

The training session shall be representative of the FVX-PAP protocol experiment to follow, with at least three trials, where a trial is here defined as a singular ranking moment following the display on the screen of probe–gallery pair and the corresponding visual explanations to be compared. A trial is considered complete only when the subject has expressed his/her preference/rank; these three trials must correspond to the three ranking possibilities, i.e., “A better than B,” “B better than A,” and “A and B equivalent.” Naturally, the training session probe–gallery pairs are not used for the test session and ranks are not considered for the final statistical analysis.

3.6 Test session

After the training session is completed, a test session is run to collect the pairwise comparison ranks for the selected pairs of visual explanations. The test session includes four key phases:

  1. 1)

    Declaration of consent—At the start of the test session, a digital consent form is presented, asking the subject to agree with the collection and storage of the personal data provided.

  2. 2)

    Subject registration—The subject registers in the FVX-PAP application providing name, email address, age, and gender; this information will be only used for statistical purposes.

  3. 3)

    Sequence of trials—A test session includes a sequence of trials, in which the subject is presented with information following the layout recommended in Fig. 3 and asked to rank the two visual explanations according to the type of FV decision (acceptance or rejection), the displayed probe and gallery images, and the alternative visual explanations. After a rank is inserted, the graphical user interface shows the next trial. During the test session trials, a set of key requirements shall be considered:

    1. a)

      The presentation order of the selected trials shall be randomized.

    2. b)

      Within each trial, the presentation side (top or bottom) for each visual explanation in the pair shall be randomized.

    3. c)

      There is no minimum or maximum time for a subject to rank for each trial.

    4. d)

      It is not allowed to rescore any trial after a rank is inserted.

    5. e)

      A fixed number of repeated trials are included in the test session for subject outlier detection.

  4. 4)

    Session closing—Once all trials in the session are completed, the subject is asked to exit the session and his/her rankings are stored in the application’s database.

If including many trials, the FVX-PAP experiment may have to be divided into multiple sub-experiments or test sessions, on the same or different days, to avoid subject fatigue. In such case, it is preferred that the same subjects are used for all the test sessions to ensure that all pairs of visual explanations are assessed by the same subjects although this is not strictly required.

3.7 Statistical processing of ranks

Once the ranks are collected for all subjects, they are processed to compute an explainability performance score for each of the assessed FV explainability tools, finally ranking them according to their explainability performance. The proposed statistical processing of the collected ranks uses methods available in the literature and consists in two main steps, notably subject outlier detection and explainability performance score estimation.

3.7.1 Subject outlier detection

The subject outlier detection step is used to detect, and if necessary discard, ranking data from subjects that provided inconsistent ranks by measuring the subjects’ reliability. Instead of measuring the subject reliability by comparing his/her rankings to those from other subjects, the subject reliability in pairwise comparison can be evaluated by examining only the individual ranking behavior of each subject along the test session trials by including some repeated trials and checking the consistency of the ranks. Consequently, the subject outlier detection is performed by checking the subject ranks regarding:

  1. 1)

    Swapping pairs—A pair of visual explanations is repeated and displayed in a second trial after swapping the two visual explanations. For subject outlier detection, it is checked if the subject’s ranks appropriately match in both trials; otherwise, the ranking is considered inconsistent.

  2. 2)

    Repeated pairs—A pair of visual explanations is repeated and displayed twice in different trials without swapping the visual explanations position. For subject outlier detection, it is checked if the subject’s ranks appropriately match in both trials; otherwise, the ranking is considered inconsistent.

  3. 3)

    Transitivity rule violation—If visual explanations of more than two explainability tools are compared, it is checked if transitivity rules for the ranks are not violated; otherwise, the ranking is considered inconsistent [44].

A transitivity rule is violated when the logical relationship between more than two visual explanations generated by distinct explainability tools is not maintained. As an example, suppose \(A,B,\) and \(C\) are three visual explanations generated by three different FV explainability tools and assume \(A > B\) indicates that visual explanation \(A\) was preferred over visual explanation \(B\) and \(A = B\) indicates that \(A\) and \(B\) are in a tie. A transitivity rule is violated when one of the cases expressed in Eqs. (1), (2), (3), and (4) occurs:

$$\left(A > B\right)\cap \left(B > C\right)\cap \left(C > A\right),$$
(1)
$$\left(A = B\right)\cap \left(B > C\right)\cap \left(C > A\right),$$
(2)
$$\left(A > B\right)\cap \left(B = C\right)\cap \left(C > A\right),$$
(3)
$$\left(A > B\right)\cap \left(B > C\right)\cap \left(C = A\right).$$
(4)

Considering the three rules above for subject outlier detection, a subject \(i\) is considered or not outlier according to

$$\left\{\begin{array}{c}Subject\,is\,an\,outlier; if {IR}_{i}>IRThresh\\ Subject\,is\,not\,an\,outlier; if {IR}_{i} \le IRThresh\end{array}\right.,$$
(5)

where \({IR}_{i}\) is the number of inconsistent ranks provided by subject \(i\) and \(IRThresh\) is the maximum number of inconsistent ranks allowed for a single subject.

3.7.2 Explainability performance score estimation

The explainability performance scores for the assessed explainability tools are estimated through two key steps:

1) Accumulated pairwise comparison matrix computation—The ranks for a subject obtained during the test session are represented using a preference matrix with each entry representing the number of times a subject preferred/ranked the visual explanations of a given explainability tool over the visual explanations of another explainability tool. The values of the preference matrices for all subjects who participated in the experiment are accumulated in an Accumulated Pairwise Comparison Matrix (APCM). The APCM is the key data structure used for explainability performance score estimation, with each entry having the same meaning as above, now considering all subjects who participated in the subjective assessment. The APCM values may be seen as the winning counts of each explainability tool. Equation (6) illustrates the structure of the APCM matrix, when \(M\) explainability tools are under assessment:

(6)

with \(M\): Number of explainability tools under assessment, labeled as \({E}_{m}, m\in \left\{1, 2, 3,\dots M\right\}\).

\({Win\_Count}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}:\) The winning counts of explainability tool \({E}_{m}\) over explainability tool \({E}_{n}, m, n\in \left\{1, 2, 3,\dots M\right\},\) representing the number of times the visual explanations from explainability tool \({E}_{m}\) are preferred/ranked over those from explainability tool \({E}_{n}\), considering all participating subjects across all test session trials.

Since ties are allowed in the FVX-PAP protocol, i.e., “A equivalent to B,” they need to be considered in the explainability performance score estimation process. A simple solution to deal with ties in the explainability performance score estimation is to consider a tie as contributing equally to both options [45]. The winning counts considering ties, \({Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}\) should thus include half the rank for both explainability tools [45], as illustrated in Eqs. (7) and (8):

$${Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}= {Win\_Count}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}+ \frac{{Ties}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}}{2},$$
(7)
$${Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{n}}}, {{\varvec{E}}}_{{\varvec{m}}}}= {Win\_Count}_{{{\varvec{E}}}_{{\varvec{n}}}, {{\varvec{E}}}_{{\varvec{m}}}}+ \frac{{Ties}_{{{\varvec{E}}}_{{\varvec{n}}}, {{\varvec{E}}}_{{\varvec{m}}}}}{2},$$
(8)

where \({Ties}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}}\) is defined as the number of times the FV explainability tools \({E}_{m}\) and \({E}_{n}\) are in a tie.

Consequently, following Eq. (6), the final Accumulated Pairwise Comparison Matrix, taking into account ties, \({\text{APCM }}_{Ties}\) is defined as

(9)

At this stage, the probability that the visual explanations from explainability tool \({E}_{m}\) are preferred/ranked over the visual explanations from explainability tool \({E}_{n}\) can be computed as

$${P}_{mn}= \frac{{Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{m}}},\boldsymbol{ }\boldsymbol{ }{{\varvec{E}}}_{{\varvec{n}}}} }{{Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{m}}}, {{\varvec{E}}}_{{\varvec{n}}}}+{ Win\_Count\_Ties}_{{{\varvec{E}}}_{{\varvec{n}}}, {{\varvec{E}}}_{{\varvec{m}}}}}.$$
(10)

2) Explainability performance score computation—Once the \({APCM}_{Ties}\) matrix is obtained, an eXplainability Performance Score, \({XPS}_{m}\), \(m\in \left\{\text{1,2},\dots , X\right\}\), can be estimated for each of the assessed explainability tools, using a scaling method. Commonly, pairwise comparison data are scaled using one of two models, notably the Bradley–Terry (BT) model [46] or the Thurstone model [47].

The FVX-PAP protocol adopts the BT model, which is a probabilistic model used to estimate explainability performance scores, \({XPS}_{m}\), \(m\in \left\{\text{1,2},\dots , M\right\}\), which satisfy \({\sum }_{m=1}^{M}{XPS}_{m}=1\), \({XPS}_{m}\ge 0\), using the maximum likelihood estimation method. The BT model core assumption is that the probability \({P}_{mn}\) of selecting an explainability tool, \({E}_{m}\), over another explainability tool,\({E}_{n}\), is given by the ratio of their explainability performance scores, \({XPS}_{m}\) and \({XPS}_{n}\), as follows [46]:

$$\frac{{XPS}_{m}}{{XPS}_{m}+ {XPS}_{n}} = {P}_{mn}.$$
(11)

This formulation allows to compute the explainability performance scores.

For the case when only two FV explainability tools are being compared, \({E}_{m}\) and \({E}_{n}\), \(m,n\in \left\{\text{1,2}\right\}\) (\(M=2\)), in accordance with the BT model core principle, the sum of their explainability performance scores is \({XPS}_{m}+ {XPS}_{n}=1\) [46]. Thus, following Eq. (11), the explainability performance scores \({XPS}_{m}\) and \({XPS}_{n}\) are obtained as

$${XSP}_{m}= {P}_{mn}\times {(XPS}_{m}+ {XPS}_{n})={P}_{mn}\times 1= {P}_{mn},$$
(12)
$${XSP}_{n}= {P}_{nm}\times {(XPS}_{n}+ {XPS}_{m})={P}_{nm}\times 1={P}_{nm}.$$
(13)

The statistical processing of ranks described above enables a robust estimation of explainability performance scores. This approach allows for the collection of preferences/ranks and the estimation of the explainability performance scores based on subjective assessments by participants across all test session trials. It not only accounts for direct preferences but also integrates ties, ensuring a comprehensive and fair estimation of scores. In the next section, this approach will be applied in a full assessment experiment using heatmaps as examples of visual explanations.

4 A FVX-PAP subjective assessment experiment

This section reports the full specification for a subjective test experiment performed according to the proposed FVX-PAP protocol.

4.1 Face image dataset

For the subjective assessment experiment, the LFW face recognition dataset [33], comprising 13,233 images featuring 5749 individuals having more than one image, was adopted. All face images in the LFW dataset have a spatial resolution of 250 × 250 pixels. Examples of LFW face images for a couple of subjects are shown in Fig. 5.

Fig. 5
figure 5

Examples of selected probe–gallery pairs and their corresponding heatmap-based visual explanations

4.2 FV system

While the FV decision can be performed using any available FV system, the popular ArcFace [48] face recognition model, with a ResNet-100 backbone network [49], was used to perform the FV task. The ArcFace face recognition model pretrained with images from the MS1MV2 dataset [48] was utilized without additional training by the authors. More details on the pretrained ArcFace model can be found in [50].

To increase the FV accuracy, the face areas in the selected LFW images test dataset are detected using the RetinaFace face detector [51] and cropped using the face landmarks provided by the same face detector. In addition, to satisfy the image resolution constraint imposed by the ArcFace face recognition model, the resolution of the detected face areas is adjusted before being cropped to the ArcFace requested spatial resolution of 112 × 112 pixels. In addition, the cosine similarity is used as similarity metric, providing similarity scores between faces in the range [− 1,1]. The FV decisions are taken using a cosine similarity threshold of 0.27 since this is the threshold leading to the best FV accuracy for the LFW dataset.

4.3 Probe–gallery pairs selection

The FVX-PAP protocol can be used to assess the performance of explainability tools for different types of FV decisions, notably to handle genuine and impostor verification attempts, which can result in acceptance or rejection FV decisions. A genuine refers to a user making a true claim, i.e., he/she really is who claims to be; in this case, the probe and gallery images to be compared correspond to the same individual. On the contrary, an impostor refers to a user making a false claim, i.e., he/she is not who claims to be; in this case, the probe and gallery images to be compared correspond to two different individuals. For proceeding with the subjective assessment experiment, the probe–gallery pairs of face images are selected through four key steps:

  1. 1)

    Genuine and impostor pairs selection—A set of genuine and impostor pairs of face images is randomly selected from the LFW dataset.

  2. 2)

    TA, FA, TR, and FR probe–gallery pairs definition—The FV decision is performed on the generated genuine and impostor pairs using the adopted similarity score value, producing four groups of probe–gallery pairs, depending on the type of FV decision, notably True Acceptance (TA), False Acceptance (FA), True Rejection (TR), or False Rejection (FR) FV decisions. The similarity scores for rejection (TR or FR) and acceptance (TA or FA) decisions are in the ranges [− 1, 0.27] and [0.27, 1], respectively.

  3. 3)

    Probe–gallery pairs selection—A set of 26 probe–gallery face pairs are selected for each type of FV decision, i.e., TA, FA, TR, and FR, to limit the test sessions duration, which should not be too long and tiring for the subjects. For each type of decision, the probe–gallery pairs have been randomly selected while guaranteeing a rather uniform distribution along the possible similarity scores range. This selection process yields a total of 104 (26 × 4) probe–gallery pairs of face images, forming the basis for the FVX-PAP test sessions. Additionally, 6 additional probe–gallery pairs were selected for the training session, naturally different from those used in the test sessions. The selected probe–gallery pairs encompass a diverse range of samples, including various genders and ethnic backgrounds.

  4. 4)

    Outlier detection probe–gallery pairs definition—To identify subject outliers, 8 trials are incorporated into each test session by duplicating or swapping the visual explanations for 8 already used trials, as discussed in Sect. 3.7.1, thus resulting in a total of 60 trials per test session. The distribution of these supplementary trials for subject outlier detection is illustrated in Table 1, considering there will be 2 test sub-experiments, one for acceptance decisions and another for rejection decisions.

Table 1 Distribution of trials for subject outlier detection

4.4 Selected heatmap-based FV explainability tools

This subjective assessment experiment will exercise the proposed FVX-PAP protocol using heatmaps as visual explanations. For this purpose, two post hoc FV explainability tools generating heatmaps as visual explanations have been selected, notably FV-RISE [9], and CorrRISE [16]. These two explainability tools are, to the best of the authors’ knowledge, the only heatmap-based tools able to explain any type of FV decision, notably for genuine and impostor face pairs and for acceptance and rejection decisions.

Both selected FV explainability tools adopt the Randomized Input Sampling for Explanation (RISE) [52] approach, originally designed to estimate the pixels’ importance in the context of object classification tasks, and apply that approach to the FV task. The key novelty of the FV-RISE and CorrRISE tools is to generate similarity and dissimilarity visual explanation, in this case heatmaps, to explain the FV decisions according to the type of decision performed, notably acceptance or rejection, for both genuine and impostor FV cases. Since these two tools adopt a similar technical approach, it is no surprise they may be assessed with rather equivalent explainability performances.

In FV-RISE, the similarity/dissimilarity heatmaps generation process is performed through three key steps:

  1. 1)

    Reference similarity score—Compute the similarity score between the original probe and gallery face images.

  2. 2)

    Similarity score for masked images—Apply random masks, following the RISE random masks generation technique, to the probe face image, to obtain a set of masked probe face images. Subsequently, the similarity score between each masked probe face image and the original gallery face image is computed.

  3. 3)

    Similarity and dissimilarity heatmaps generation—Divide the generated masks into masks for similarity and dissimilarity face regions using their respective similarity scores for masked images in relation to the reference similarity score. Following this categorization, a weighted sum is computed for each group of masks, leading to the generation of the similarity and dissimilarity heatmaps used to explain the acceptance and rejection cases, respectively, regardless of true or false FV decisions being performed.

In CorrRISE, the similarity/dissimilarity heatmaps generation process is performed through two key steps:

  1. 1)

    Masks generation—Generate a set of random masks to perturb the face image. Each mask is created by randomly generating multiple small square patches, with pixel values ranging from 0 to 1, positioned at various locations of a plain image.

  2. 2)

    Face image perturbation—Apply the generated masks to the face image and evaluate their impact on the similarity score.

  3. 3)

    Correlation-based similarity and dissimilarity heatmaps generation—Compute the Pearson correlation in a pixel-wise manner between the list of calculated similarity scores and the corresponding masks, resulting in similarity and dissimilarity heatmaps. More specifically, positive correlation coefficients contribute to the generation of the similarity heatmap, whereas negative correlation coefficients contribute to the generation of the dissimilarity heatmap.

The CorrRISE tool is designed to explain the FV decisions by generating similarity and dissimilarity heatmaps for both probe and gallery images. However, for the performed experiment, only the similarity and dissimilarity heatmaps for probe face images are used. Figure 5 shows examples of selected probe–gallery pairs, for TA, FA, TR, and FR decisions, and their corresponding face heatmaps generated by FV-RISE and CorrRISE explainability tools; for each example probe–gallery pair, the corresponding cosine similarity score is shown.

4.5 Subject selection

Following the subject selection criteria outlined in Sect. 3.2, a total of 23 subjects from multiple countries, notably Austria, China, Iran, Italy, Morocco, Palestine, Portugal, Spain, and Syria, have been selected to participate in the subjective test using the proposed FVX-PAP protocol.

The age range of the subjects was 22 to 61 years, comprising 17 males and 6 females, see Fig. 6. No subject withdrew from the subjective test following the declaration of consent. However, one subject was screened out due to the outlier detection process detailed in Sect. 3.7.1.

Fig. 6
figure 6

Age (a) and gender (b) distribution of the subjects

4.6 Application and environment conditions

While the FVX-PAP experiment may be performed in a controlled environment (e.g., a laboratory with invariant display and environment conditions), the experiments reported in this paper were conducted based on a crowdsourcing subjective evaluation, following the guidelines recommended by the European Network on Quality of Experience in Multimedia Systems and Services (Qualinet) as best practices and recommendations for Crowdsourced QoE [53]. In this context, the subjects remotely assess the explainability tools using their own computational environment and displays. The FVX-PAP protocol experiment is, therefore, performed in an uncontrolled environment but, to achieve more statistically reliable results, some guidelines had to be followed, notably in terms of display characteristics and environment conditions as detailed in Sects. 3.3 and 3.4.

A prevalent approach for facilitating crowdsourcing-based subjective evaluation involves the use of web-based, user-friendly software solutions. Thus, a custom-built software application following the FVX-PAP protocol has been builtFootnote 1 using the Node.js JavaScript runtime environment [54] and the MongoDB database management program [55]. The software application was hosted in an Apache2 web server, allowing subjects to seamlessly perform the experiment without the need for local software installation. During the application development, preliminary mock-up tests were conducted with a group of subjects to gather feedback in terms of the graphical user interface layout and identifying problems that could potentially impact the effectiveness of the subjective test associated to different operational setups.

4.7 Training and test sessions design

Since this FVX-PAP experiment adopted a crowdsourcing-based approach, the subjects were provided with access to the developed web-based software application. According to the recommendations provided in [56], it is advised that the duration of a subjective test experiment conducted by a subject should not exceed 30 min. As a consequence, this FVX-PAP experiment was conducted over two distinct test sub-experiments with appropriate duration, each lasting around 20 min. The two sub-experiments focus on distinct types of FV decisions, notably Acceptance Decision Test Sub-Experiment and Rejection Decision Test Sub-Experiment, with the link for each sub-experiment provided to the subjects on different days to avoid fatigue. In summary, the two sub-experiments sessions were designed as follows:

  1. 1)

    Acceptance decisions test sub-experiment—Designed to assess the performance of FV explainability tools for acceptance decisions, notably TA and FA decisions. Here the explainability of the FV decision is performed using the similarity face heatmap, which highlights the face regions contributing most for a FV positive decision (high verification similarity), independently of the fact that the probe–gallery pair may correspond or not to the same individual.

  2. 2)

    Rejection decisions test sub-experiment—Designed to assess the performance of FV explainability tools for rejection decisions, notably TR and FR decisions. Here the explainability of the FV decision is performed using the dissimilarity face heatmap, which highlights the face regions contributing most for a FV negative decision (low verification similarity), independently of the fact that the probe–gallery pair may correspond or not to the same individual.

Each test sub-experiment comprises three key phases:

  1. 1)

    Presentation of instructions—The subjects were presented with a set of instructions to ensure the full understanding of the task to be performed. Matching but slightly different instructions were provided for each sub-experiment, tailored to the type of the FV decision to be explained, i.e., acceptance or rejection. Consequently, the instructions for each sub-experiment describe the type of FV decision to be explained, the nature of face heatmaps provided, and what the subject is going to see.

  2. 2)

    Training session—Subsequently, the subjects undergo a training session aimed at familiarization with the user interface and task at hand. The training session for each sub-experiment included three trials, each corresponding to the three ranking possibilities, i.e., “A better than B,” “B better than A,” or “A and B equivalent,” with examples of face heatmaps for the associated FV decisions.

  3. 3)

    Test session—Upon completion of the training session, the subjects register with the software application and start the test session. The test session initiates with a concise reminder recalling the subject of the type of FV decision to be explained by the explainability tools being assessed, and the way to perform the ranking. Afterward, as detailed in Sect. 3.6, the test session trials unfold one after the other, with the subject asked to express his/her relative preference/rank regarding the displayed face heatmaps.

The test session for each sub-experiment comprises 60 trials, with 30 trials allocated to each type of FV decision, acceptance (TA and FA) and rejection (TR and FR), respectively. The presentation order of the 60 trials within each test session is randomized independently of the type of FV decision, e.g., TA and FA for one test session and TR and FR for the other.

5 Experimental results and analysis

This section reports and discusses the experimental explainability performance scores obtained with the proposed FVX-PAP protocol using the subjective test specifications described in Sect. 4.

5.1 Subject outlier detection

The subject outlier detection process described in Sect. 3.7.1 was followed to detect subjects who have not focused adequately or were answering more randomly during the FVX-PAP experiment. Given the present FVX-PAP experiment only considered two FV explainability tools, i.e., FV-RISE and CorrRISE, the violation of the transitivity rule has not been considered to check the reliability of subjects. Consequently, the determination of inconsistent subject ranks was conducted based on swapping pairs and repeated pairs as detailed in Table 1.

Given Eq. (5), the maximum number of inconsistent rankings allowed by a subject before exclusion has been established as \(IRThresh=3\). This has led to screening out one subject in the Acceptance Decisions Test Sub-Experiment and no subjects in the Rejection Decisions Test Sub-Experiment. Consequently, the rankings from 22 subjects (Acceptance Decisions Test Sub-Experiment) and 23 subjects (Rejection Decisions Test Sub-Experiment) will be used in the following to estimate the eXplainability Performance Scores, \(XPS\), for the evaluated explainability tools.

5.2 Explainability performance score estimation

Once the subjects’ ranks were collected, the main goal was to estimate the explainability performance scores, \(XPS\), for the assessed explainability tools. The following subsections report and analyze the estimated explainability performance scores obtained for each type of FV decision, notably acceptance (TA and FA) and rejection (TR and FR) as well as for all FV decisions together.

5.2.1 Accumulative pairwise comparison matrices generation

The subject preferences obtained with the FVX-PAP protocol appear as winning and ties frequencies between each pair of face heatmaps. Consequently, to estimate the explainability performance scores of the assessed tools, the subject preferences are translated into numerical rankings to simplify the estimation process leading to a numerical score. More specifically, after each trial ranking, a score of “+ 1” is attributed to the explainability tool with the preferred face heatmap, a score of “0” is attributed to the explainability tool with the non-preferred face heatmap, while a score of “+ 0.5” is attributed to both explainability tools when the face heatmaps of the pair are in a tie, i.e., they are equivalent in the explainability power. After, the APCM matrix is constructed as defined in Subsection 3.7.2 by accumulating the preference matrices of all participating subjects, thereby capturing the frequency with which the face heatmaps of each assessed explainability tool are preferred over those of the other explainability tool.

The subjects’ preference matrices and the APCM matrix generation processes are performed independently for each type of FV decision, i.e., TA, FA, TR, and FR, thereby providing insights into the efficacy of the assessed explainability tools across the four FV decision types. \({APCM }_{TA}, {APCM }_{FA}, {APCM }_{TR},\) and \({APCM }_{FR}\) refer to the APCM matrices associated to each type of FV decision, wherein the matrices entries correspond to the number of times the face heatmaps of one explainability tool are preferred over the other for the TA, FA, TR, and FR FV decisions, respectively. For the performed experiments, the counting of the ranks led to the following matrices:

$$\begin{array}{ccc}\\& \\ & \\ \begin{array}{c} {{\varvec{A}}{\varvec{P}}{\varvec{C}}{\varvec{M}}}_{{\varvec{T}}{\varvec{A}}}= \\ \end{array} \end{array} \begin{array}{c} \begin{array}{c}FV\_RISE \text{CorrRISE }\\ \end{array}\\ \begin{array}{c}FV\_RISE\\ \\ \text{CorrRISE }\end{array}\left(\begin{array}{ccc}0& & 310.5\\ & & \\ 349.5& & 0\end{array}\right),\\ \end{array} \begin{array}{ccc}\\& \\ & \\ \begin{array}{c} {{\varvec{A}}{\varvec{P}}{\varvec{C}}{\varvec{M}}}_{{\varvec{F}}{\varvec{A}}}= \\ \end{array}\end{array}\begin{array}{c} \begin{array}{c}FV\_RISE \text{CorrRISE }\\ \end{array}\\ \begin{array}{c}FV\_RISE\\ \\ \text{CorrRISE }\end{array}\left(\begin{array}{ccc}0& & 326\\ & & \\ 334& & 0\end{array}\right),\\ \end{array}$$
$$\begin{array}{ccc}\\& \\ & \\ \begin{array}{c} {{\varvec{A}}{\varvec{P}}{\varvec{C}}{\varvec{M}}}_{{\varvec{T}}{\varvec{R}}}= \\ \end{array}\end{array}\begin{array}{c} \begin{array}{c}FV\_RISE \text{CorrRISE }\\ \end{array}\\ \begin{array}{c}FV\_RISE\\ \\ \text{CorrRISE }\end{array} \left(\begin{array}{ccc}0& & 348.5\\ & & \\ 341.5& & 0\end{array}\right),\\ \end{array}\begin{array}{ccc}\\ & \\ & \\ \begin{array}{c} {{\varvec{A}}{\varvec{P}}{\varvec{C}}{\varvec{M}}}_{{\varvec{F}}{\varvec{R}}}= \\ \end{array}\end{array}\begin{array}{c} \begin{array}{c}FV\_RISE \text{CorrRISE }\\ \end{array}\\ \begin{array}{c}FV\_RISE\\ \\ \text{CorrRISE }\end{array}\left(\begin{array}{ccc}0& & 347.5\\ & & \\ 342.5& & 0\end{array}\right).\\ \end{array}$$

Table 2 provides a detailed distribution of the subjects’ ranks collected for the different types of FV decisions, i.e., acceptance (TA and FA) and rejection (TR and FR), according to the FVX-PAP protocol. Specifically, \({Win\_Count}_{{\varvec{F}}{\varvec{V}}\_{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}, \ {\varvec{C}}{\varvec{o}}{\varvec{r}}{\varvec{r}}{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}}\) and \({Win\_Count}_{{\varvec{C}}{\varvec{o}}{\varvec{r}}{\varvec{r}}{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}, \ {\varvec{F}}{\varvec{V}}\_{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}}\) indicate the number of times the FV-RISE face heatmaps are preferred over the CorrRISE face heatmaps, excluding tied ranks, and vice versa, respectively. Moreover, the \({Ties}_{{\varvec{F}}{\varvec{V}}\_{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}, \ {\varvec{C}}{\varvec{o}}{\varvec{r}}{\varvec{r}}{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}}\) indicates the number of times the FV-RISE and CorrRISE face heatmaps are in a tie. Finally, \({Win\_Count\_Ties}_{{\varvec{F}}{\varvec{V}}\_{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}, \ {\varvec{C}}{\varvec{o}}{\varvec{r}}{\varvec{r}}{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}}\) and \({Win\_Count\_Ties}_{{\varvec{C}}{\varvec{o}}{\varvec{r}}{\varvec{r}}{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}, \ {\varvec{F}}{\varvec{V}}\_{\varvec{R}}{\varvec{I}}{\varvec{S}}{\varvec{E}}}\) indicate the number of times the FV-RISE face heatmaps are preferred over the CorrRISE face heatmaps, incorporating the tied ranks, and vice versa, respectively.

Table 2 Subjects’ rankings across the FVX-PAP protocol experiments

5.2.2 Explainability performance scores

To effectively assess with some detail the efficacy of the FV-RISE and CorrRISE explainability tools, a comprehensive approach is followed, which involves assessing their performance across the four FV decision types, notably TA, FA, TR, and FR. Consequently, the \({APCM }_{TA}, {APCM }_{FA}, {APCM }_{TR},\) and \({APCM }_{FR}\) matrices are used as described in Subsection 3.7.2 to estimate the corresponding explainability performance scores.

The FV-RISE and CorrRISE explainability performance scores for the different types of FV decisions e.g., acceptance (TA, and FA) and rejection (TR and FR) as well as the overall explainability performance scores, considering all cases together are depicted in Fig. 7. A higher explainability performance score indicates superior explainability performance for one tool regarding the other and vice versa for the relevant FV decision-making.

Fig. 7
figure 7

Explainability performance scores for FV-RISE and CorrRISE FV explainability tools: a for acceptance decisions; b for rejection decisions; c jointly for acceptance and rejection decisions; and d overall explainability performance scores

The results in Fig. 7 allow deriving the following observations and conclusions:

  • True acceptance decisions—CorrRISE outperforms FV-RISE by achieving an explainability performance score of 0.530 compared to a FV-RISE score of 0.470, see left side of Fig. 7a, which indicates the ability of CorrRISE to provide better visual explanations when processing genuine faces and taking acceptance decisions.

  • False acceptance decisions—The right side of Fig. 7a shows that CorrRISE and FV-RISE offer an equivalent explainability performance for false acceptance decisions, notably 0.506 versus 0.494, which proves that both explainability tools offer nearly equivalent visual explanations when an impostor is erroneously accepted as a genuine user.

  • True rejection decisions—FV-RISE and CorrRISE achieve equivalent explainability performance score for true rejection decisions, see left side of Fig. 7b. This behavior indicates that both explainability tools generate nearly similar dissimilarity visual explanations to explain rejection decisions when impostor faces are processed.

  • False rejection decisions—For false rejections, the two tools offer almost equal explainability performance scores, see the right side of Fig. 7b. This reinforces the conclusion that both explainability tools generate nearly similar dissimilarity visual explanations, but this time when genuine faces are processed and rejection decisions are made.

To gain a better understanding of the assessed explainability tools performance, the comparison of tools may be also carried out with respect to the type of FV decision, i.e., acceptance or rejection, regardless of whether genuine or impostor faces are processed, and all decision together, with the following conclusions:

  • Acceptance decisions—The explainability scores reveal that CorrRISE performs slightly better than FV-RISE when acceptance decisions are made by exhibiting an explainability performance score of 0.52 compared to 0.48 for FV-RISE, see the left side of Fig. 7c.

  • Rejection decisions—For rejection decisions, FV-RISE exhibits slightly superior performance to CorrRISE with an explainability performance score of 0.504 versus 0.496 for CorrRISE. This result reinforces the observation from Fig. 7b, indicating that both explainability tools offer nearly equivalent dissimilar visual explanations to explain FV rejection decisions.

  • Overall decisions—The explainability performance scores are also estimated for FV-RISE and CorrRISE considering all the cases, regardless of whether genuine or impostor faces are processed and irrespective of acceptance or rejection decisions. Figure 7d allows concluding that both FV-RISE and CorrRISE offer a rather similar explainability performance with CorrRISE showing a very slight advantage over FV-RISE. This similar performance was somehow expected since the two explainability tools are technically rather similar as they share the adoption of the RISE random-masking technique to produce similarity/dissimilarity visual explanations, thus highlighting rather similar/dissimilar face regions in their explanations. Unfortunately, it was not possible to perform this FVX-PAP proof-of-concept experiment with other, technically mode different, FV explainability tools since no other tools generating heatmap-based visual explanations for the four types of FV decisions, i.e., TA, FA, TR, and FR, are available, especially the false FV decisions, i.e., FA and FR, while the target was also to validate the proposed protocol across the four types of FV decisions.

In summary, the proposed FVX-PAP protocol offers a comprehensive approach for evaluating and comparing FV explainability tools generating visual explanations; this has demonstrated with an experiment involving heatmaps. The FVX-PAP protocol facilitates the quantification of human subjective evaluations pertaining to these tools by comparing their visual explanation outputs in pairs. Through this structured approach, the protocol enables the estimation of explainability performance scores, thereby facilitating a relative assessment of the explainability performance offered by the evaluated FV explainability tools.

6 Final remarks and future work

In recent years, FV explainability tools have indeed gained attention due to the increasing use of face recognition technology and growing concerns related to its potential biases and lack of transparency. These tools are especially critical for complex DL-based FV systems/models by enabling the end-user to understand how these systems arrive at their decision-making. Despite the remarkable strides made by explainability tools in enhancing the transparency of FV systems, a significant gap remains in the comprehensive assessment of their performance. The critical importance of assessing the performance of these tools cannot be overstated, as it directly impacts the transparency and trustworthiness of FV systems in real-world applications.

In this context, this paper proposed, for the first time, a subjective performance assessment protocol for FV explainability tools generating visual explanations. The key objective of the proposed protocol is to evaluate the explainability performance of FV explainability tools that generate any type of visual explanations, such as saliency maps, heatmaps, contour-based visualization maps, and face segmentation maps, offering a systematic approach for assessing the relative performance of two or more FV explainability tools that generate the same type of visual explanations. The evaluation is accomplished by adopting the pairwise comparison method to compare the visual explanations through a group of subjects, expressing their preferences regarding the pairs of visual explanations. The subjects’ preferences are collected and statistically analyzed to estimate the relative explainability performance for the assessed FV explainability tools.

The efficacy of the proposed protocol was validated using two heatmap-based FV explainability tools, e.g., FV-RISE and CorrRISE, as instances of FV explainability tools generating visual explanations, across the four types of FV decisions, notably TA, FA, TR, and FR. The explainability performance scores estimation results revealed that both evaluated explainability tools exhibit comparable behavior across the FA, TR, and FR decisions, by generating nearly similar visual explanations; however, the CorrRISE tool performs slightly better than FV-RISE for TA decisions. Future work will use the FVX-PAP protocol to assess the explainability performance of other tools, notably not using heatmap-based visual explanations, and also to compare explainability tools using different types of visual explanations. Moreover, the proposed protocol will be used to examine the influence of gender and ethnicity in the test data and subjects on the explainability performance of FV systems.

Data availability

The data and source code of the software application used for the proposed protocol validation are available in the GitHub repository, https://github.com/NaimaBousnina/Subjective_FV_Explainability_Performance_Assessment/tree/main.

Notes

  1. The custom-built software application follows the proposed FVX-PAP protocol; for more details, please access the software implementation at https://github.com/NaimaBousnina/Subjective_FV_Explainability_Performance_Assessment/tree/main.

Abbreviations

APCM:

Accumulated pairwise comparison matrix

BT:

Bradley–Terry

CFP:

Celebrities in frontal-profile

CorrRISE:

Correlation-based randomized input sampling for explanation

DL:

Deep learning

FA:

False acceptance

FR:

False rejection

FV:

Face verification

FV-RISE:

Face verification-based randomized input sampling for explanation

HSV:

“Hue, Saturation, Value”

LFW:

Labeled faces in the wild

ResNet-100:

Residual network-100

RISE:

Randomized input sampling for explanation

TA:

True acceptance

TR:

True rejection

XPS:

EXplainability performance score

References

  1. A.K. Jain, K. Nandakumar, A. Nagar, Biometric template security. EURASIP J. Adv. Signal Process. 2008(113), 1–17 (2008). https://doi.org/10.1155/2008/579416

    Article  Google Scholar 

  2. A.K. Jain, A. Ross, S. Prabhakar, An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004). https://doi.org/10.1109/TCSVT.2003.818349

    Article  Google Scholar 

  3. B. Yalavarthi et al., Enhancing privacy in face analytics using fully homomorphic encryption (2024). arXiv:2404.16255v1

  4. M. Huber, A. T. Luu, P. Terhörst, N. Damer, Efficient explainable face verification based on similarity score argument backpropagation (2023). arXiv:2304.13409v2

  5. D. Almeida, K. Shmarko, E. Lomas, The ethics of facial recognition technologies, surveillance, and accountability in an age of artificial intelligence: a comparative analysis of US, EU, and UK Regulatory Frameworks. AI Ethics. 2, 377–387 (2022). https://doi.org/10.1007/s43681-021-00077-w

    Article  Google Scholar 

  6. P. C. Neto et al., Causality-inspired taxonomy for explainable artificial intelligence (2024). arXiv:2208.09500v2

  7. M. Knoche, T. Teepe, S. Hormann, G. Rigoll, Explainable model-agnostic similarity and confidence in face verification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 2023

  8. D. Mery, B. Morris, On black-box explanation for face verification. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2022

  9. N. Bousnina, J. Ascenso, P. L. Correia, F. Pereira, A RISE-based explainability method for genuine and impostor face verification. In: International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 2023

  10. X. Bai et al., Explainable deep learning for efficient and robust pattern recognition: a survey of recent developments. Pattern Recognit. (2021). https://doi.org/10.1016/j.patcog.2021.108102

    Article  Google Scholar 

  11. A. Adadi, M. Berrada, Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access. 6, 52138–52160 (2018). https://doi.org/10.1109/ACCESS.2018.2870052

    Article  Google Scholar 

  12. D.V. Carvalho, E.M. Pereira, J.S. Cardoso, Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 1–34 (2019). https://doi.org/10.3390/electronics8080832

    Article  Google Scholar 

  13. M. Nauta et al., From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating Explainable AI. ACM Comput. Surv. 55(13s), 1–34 (2023). https://doi.org/10.1145/3583558

    Article  Google Scholar 

  14. J. R. Williford, B. B. May, J. Byrne, Explainable face recognition. In: Computer Vision—ECCV 2020, Glasgow, UK, August 2020, ed. by A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm

  15. Y.-S. Lin et al., xCos: an explainable cosine metric for face verification task. ACM Trans. Multimedia Comput. Commun. Appl. 17(3s), 1–16 (2021). https://doi.org/10.1145/3469288

    Article  Google Scholar 

  16. Y. Lu, Z. Xu, T. Ebrahimi, Towards visual saliency explanations of face verification (2023). arXiv:2305.08546v4

  17. B. Yin et al., Towards interpretable face recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019

  18. M. Winter, W. Bailer, G. Thallinger, Demystifying face-recognition with locally interpretable boosted features (LIBF). In: 10th European Workshop on Visual Information Processing (EUVIP), Lisbon, Portugal, 2022

  19. Y. Lu, T. Ebrahimi, Explanation of face recognition via saliency maps. In: Applications of Digital Image Processing XLVI, ed. by A. G. Tescher and T. Ebrahimi, San Diego, CA, United States, 2023

  20. G. Vilone, L. Longo, Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 76, 89–106 (2021). https://doi.org/10.1016/j.inffus.2021.05.009

    Article  Google Scholar 

  21. F. Doshi-Velez, B. Kim, Considerations for evaluation and generalization in interpretable machine learning, in Explainable and interpretable models in computer vision and machine learning. ed. by H.J. Escalante, S. Escalera, I. Guyon, X. Baró, Y. Güçlütürk, U. Güçlü, M. Van Gerven (Springer International Publishing, Cham, 2018), pp.3–17

    Chapter  Google Scholar 

  22. W. Yang et al., Survey on explainable AI: from approaches, limitations and applications aspects. Hum.-Cent. Intell. Syst. 3, 161–188 (2023). https://doi.org/10.1007/s44230-023-00038-y

    Article  Google Scholar 

  23. J. Colin, T. FEL, R. Cadene, T. Serre, What I cannot predict, I do not understand: A human-centered evaluation framework for explainability methods. In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Vol. 35 (Curran Associates, Inc., 2022), pp. 2832–2845

  24. K. Sokol, J. E. Vogt, What does evaluation of explainable artificial intelligence actually tell us? A case for compositional and contextual validation of XAI building blocks. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Honolulu HI USA, 2024

  25. A. Barredo Arrieta et al., Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020). https://doi.org/10.1016/j.inffus.2019.12.012

    Article  Google Scholar 

  26. Z.C. Lipton, The mythos of model interpretability. Commun. ACM 61(10), 36–43 (2018). https://doi.org/10.1145/323323

    Article  Google Scholar 

  27. G. Castanon, J. Byrne, Visualizing and quantifying discriminative features for face recognition. In: 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 2018

  28. Z. Xu, Y. Lu, T. Ebrahimi, Discriminative deep feature visualization for explainable face recognition. In: IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 2023

  29. D. Mery, True black-box explanation in facial analysis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW0), New Orleans, LA, USA, 2022

  30. A. Rajpal, K. Sehra, R. Bagri, P. Sikka, XAI-FR: explainable AI-based face recognition using deep neural networks. Wirel. Pers. Commun. 129, 663–680 (2023). https://doi.org/10.1007/s11277-022-10127-z

    Article  Google Scholar 

  31. H. Jiang, D. Zeng, Explainable face recognition based on accurate facial compositions. In: IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 2021

  32. R.K. Mantiuk, A. Tomaszewska, R. Mantiuk, Comparison of four subjective methods for image quality assessment. Comput. Graph. Forum 31(8), 2478–2491 (2012). https://doi.org/10.1111/j.1467-8659.2012.03188.x

    Article  Google Scholar 

  33. G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In: Workshop on Faces in’Real-Life’Images: Detection, Alignment, and Recognition, Marseille, France, 2008

  34. R. Correia, P. Correia, F. Pereira, Face verification explainability heatmap generation. In: International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 2023

  35. O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition. In: Proceedings of the British Machine Vision Conference, Swansea, UK, 2015

  36. A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 2018

  37. K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the 2nd International Conference on Learning Representations, Banff, AB, Canada, 2013

  38. A. Stylianou, R. Souvenir, R. Pless, Visualizing deep similarity networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 2019

  39. S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, S. Zafeiriou, AgeDB: The first manually collected, in-the-wild age database. In: IEEE Conference on Computer Vision and Pattern Recognition Work-shops (CVPRW), Honolulu, HI, USA, 2017

  40. S. Sengupta et al., Frontal to profile face verification in the wild. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016

  41. Standard ISO/IEC 29170-2:2015, Information technology—Advanced image coding and evaluation—Part 2: Evaluation procedure for nearly lossless coding. (Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), 2015), https://www.iso.org/standard/66094.html. Accessed 23 February 2024

  42. Recommendation ITU-T P.910, Subjective video quality assessment methods for multimedia applications. (ITU-T Telecommunication Standardization Sector of ITU, 2022), https://www.itu.int/rec/T-REC-P.910-202207-S. Accessed 21 February 2024

  43. S. Bel, Color blindness test: Color deficiency testing plates, (Independently published, 2021), pp. 1–50

  44. Z. Zhang et al., An improved pairwise comparison scaling method for subjective image quality assessment. In: IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Cagliari, Italy, 2017

  45. M.E. Glickman, Parameter estimation in large dynamic paired comparison experiments. J. R. Stat. Soc. C Appl. Stat. 48(3), 377–394 (1999). https://doi.org/10.1111/1467-9876.00159

    Article  Google Scholar 

  46. R.A. Bradley, M.E. Terry, Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 324–345 (1952). https://doi.org/10.2307/2334029

    Article  MathSciNet  Google Scholar 

  47. L.L. Thurstone, A law of comparative judgment. Psychol. Rev. 34(4), 273–286 (1927). https://doi.org/10.1037/h0070288

    Article  Google Scholar 

  48. J. Deng, J. Guo, N. Xue, S. Zafeiriou, ArcFace: additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019

  49. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016

  50. Distributed Arcface Training in Pytorch, https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch. Accessed 11 July 2024

  51. J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou, RetinaFace: single-shot multi-level face localisation in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020

  52. V. Petsiuk, A. Das, K. Saenko, RISE: Randomized input sampling for explanation of black-box models (2018). arXiv:1806.07421v3

  53. T. Hoßfeld et al., Best practices and recommendations for crowdsourced QoE lessons learned from the Qualinet WG2 task force “Crowdsourcing”. (COST Action IC1003 European Network on Quality of Experience in Multimedia Systems and Services (QUALINET), 2014), https://infoscience.epfl.ch/record/204797?ln=en. Accessed 23 February 2024

  54. OpenJS Foundation and Node.js contributors, Node.Js. https://nodejs.org/en. Accessed 23 February 2024

  55. MongoDB, Inc., MongoDB. https://www.mongodb.com/. Accessed 23 February 2024

  56. Recommendation ITU-R BT.500-10, Methodology for the subjective assessment of the quality of television pictures. (ITU Radiocommunication Sector, 2000), https://www.itu.int/rec/R-REC-BT.500-10-200003-S/en. Accessed 23 February 2024

Download references

Acknowledgements

This work has been partially supported by the European CHIST-ERA program via the French National Research Agency (ANR) within the XAIface project (grant agreement CHIST-ERA-19-XAI-011) and Fundação para a Ciência e a Tecnologia in Portugal under the project CHIST-ERA/0003/2019 with https://doi.org/10.54499/CHIST-ERA/0003/2019.

Funding

This study was supported by Fundação para a Ciência e a Tecnologia, https://doi.org/10.54499/CHIST-ERA/0003/2019.

Author information

Authors and Affiliations

Authors

Contributions

Not applicable.

Corresponding author

Correspondence to Naima Bousnina.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bousnina, N., Ascenso, J., Correia, P.L. et al. Subjective performance assessment protocol for visual explanations-based face verification explainability. J Image Video Proc. 2024, 33 (2024). https://doi.org/10.1186/s13640-024-00645-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00645-0

Keywords