Benchmark for Anonymous Video Analytics

Out-of-home audience measurement aims to count and characterize the people exposed to advertising content in the physical world. While audience measurement solutions based on computer vision are of increasing interest, no commonly accepted benchmark exists to evaluate and compare their performance. In this paper, we propose the first benchmark for digital out-of-home audience measurement that evaluates the vision-based tasks of audience localization and counting, and audience demographics. The benchmark is composed of a novel, dataset captured at multiple locations and a set of performance measures. Using the benchmark, we present an in-depth comparison of eight open-source algorithms on four hardware platforms with GPU and CPU-optimized inferences and of two commercial off-the-shelf solutions for localization, count, age, and gender estimation. This benchmark and related open-source codes are available at http://ava.eecs.qmul.ac.uk.


Introduction
Digital out-of-home advertisement is rapidly growing thanks to the availability of affordable, internetconnected smart screens. Anonymous Video Analytics (AVA) aims to enable real-time understanding of audiences exposed to advertisements in order to estimate the reach and effectiveness of each advertisement. AVA ensures the preservation of the privacy of audience members by performing inferences and aggregating them directly on edge systems, without recording or streaming raw data ( Figure 1).
AVA relies on person detectors or trackers to localize people and to enable the estimation of audience attributes, such as their demographics. AVA should produce accurate results, be robust to environmental variations and varying illumination. While well-established performance measures exist to evaluate generic computer vision algorithms [1][2][3][4], these measures do not take into account the desirable features of AVA for out-of-home advertisement, such as the opportunity for a person to see the advertisement. In fact, multiple datasets exists to benchmark detection [5,6], tracking [7][8][9][10][11] and re-identification algorithms [12]. Further datasets include KITTI [5], which * ricardo. sanchezmatilla@qmul  focuses on autonomous driving, and Common Objects in Context (COCO) [6], which covers object detection, segmentation, and captioning. Moreover, the Robust Vision Challenge [13] evaluates scene reconstruction, optical flow, semantic and instance segmentation, and depth prediction tasks, the Visual Object Tracking Benchmark [10,11] compares single-object tracking algorithms and the Multiple Object Tracking Benchmark [7][8][9] compares multiple-object trackers. These benchmarks and datasets are designed for scenarios that differ from those for digital out-of-home advertisement and their annotations lack relevant information, such as demographics and attention of the audience, for assessing AVA algorithms. Because of the growing importance of digital out-ofhome advertisement and the lack of a standard evaluation protocol, in this paper we present the Benchmark for Anonymous Video Analytics 1 . This work is the first publicly available benchmark specifically designed to evaluate AVA solutions. The benchmark includes a set of performance measures specifically designed for audience measurement, an online evaluation tool, a novel fully-annotated dataset for digital out-of-home AVA, and open-source baseline algorithms and evaluation codes. The dataset annotations include over a million localization bounding boxes, and age, gender, attention, pose, and occlusion information. We also benchmark eight baseline algorithms: two face detectors, two person trackers, two age estimators, two gender estimators; and two commercial off-the-shelf solutions.
The paper is structured as follows. Section 1.1 introduces the main definitions of the work and describes the proposed analytics for AVA; Section 1.2 presents the performance measures used for benchmarking; Section 2 describes the detection, tracking, age, and gender estimation algorithms; Section 3 introduces the  proposed dataset and its annotation; Section 4 presents the benchmarking results; and Section 5 summarizes the findings of the work.

Analytics
Let a digital signage be equipped with a system and a camera. The system is a computer that manages the advertisement playback and processes the video of the surroundings of the signage captured by the camera. Let the video V = {I t } T t=1 be composed of T frames I t , each with frame index t. We consider the following attributes about the people in the video: count, age, and gender (see Figure 2). To enable the estimation of the above attributes, we also consider the localization of people in V, namely their position and dimensions in I t . We consider a person in I t to have Opportunity to See (OTS) the signage when their face is visible from its left profile to its right profile, and the person is not heading opposite to the location of the camera, as shown in Figure 3. We consider only the attributes of people with OTS.
Let the estimated location and dimensions of person j ∈ N with OTS in I t be represented with a bounding boxd j t = [x, y, w, h], where N is the set of the natural numbers. The bounding box is defined by the horizontal, x, and vertical, y, image coordinates of its top-left corner, and by its width, w, and height, h. The location of person j,d j t , may be represented by their face or their body, and can be estimated with a detection or tracking algorithm. While with a detector, the index j may change over time (i.e. the index j is not related to the identity of the person), trackers aim to maintain the index j consistent over time.
Localization algorithms enable the estimation of the number of people with OTS (counting) at time t, n t ∈ N, and trackers enable the estimation of the number of unique 2 cumulative people with OTS within a time window between t 1 and t 2 , n t1:t2 ∈ N. If A is a set of age ranges, an age estimation algorithm is expected to determine the age, a j t ∈ A, of a person with OTS,d j t , with A defined as: [19,34] These age ranges have been selected as they are commonly used in audience analytics. A gender estimation algorithm determines the gender, g j t ∈ G, of each detected person with OTS, is the set of possible classes. In summary, for each person j with OTS, an AVA solution is expected to produce at each time t: j, the person index (for trackers, the tracking identity consistent throughout V);d j t , the estimated location of the face and/or body;â j t ∈ A, the estimated age; and g j t ∈ G, the estimated gender.

Performance measures
We introduce a set of performance measures for assessing the accuracy of localization, counting, age and gender estimation. These measures, which are concise and easy to understand by a broad community, enable the evaluation and comparison of AVA algorithms.

Localization
We evaluate the localization performance based on precision (P), recall (R), and F1-Score (F) [4], which are defined based on true positives (TP), false positives (FP), and false negatives (FN). Precision (P) is the ratio between correct estimations and the total number of estimations. Recall (R) is the ratio between correct estimations and the total number of actual occurrences. F1-Score (F) is the harmonic mean of P and R.
For localization, we define TP, FP, and FN based on the intersection over union (IOU) operator between estimations,d j t , and annotations, d j t : where ∩ and ∪ are the intersection and the union operators, and IOU ∈ [0, 1]. An example for face localization, with highlighted TP, FP and FN detections is given in Figure 5.
We consider the variation of R for different personsignage distances. Let A(·) be a function that computes the area in pixels for bounding box d j t , and p a dist be the a-th percentile of all the bounding boxes in the video. We define two bands for the person-signage distance, namely close when A(d j t ) ≥ p 50 dist and far when A(d j t ) < p 50 dist . We assume that people closer to the A gender estimation algorithm determines the gender, g j t 2 G, of each detected person with OTS,d j t , where is the set of possible classes. In summary, for each person j with OTS, an AVA solution is expected to produce at each time t: j, the person index (for trackers, the tracking identity consistent throughout V);d j t , the estimated location of the face and/or body;â j t 2 A, the estimated age; andĝ j t 2 G, the estimated gender.

Performance measures
We introduce a set of performance measures for assessing the accuracy of localization, counting, age and gender estimation. These measures, which are concise and easy to understand by a broad community, enable the evaluation and comparison of AVA algorithms.

Localization
We evaluate the localization performance based on precision (P), recall (R), and F1-Score (F) [4], which are defined based on true positives (TP), false positives (FP), and false negatives (FN). Precision (P) is the ratio between correct estimations and the total number of estimations. Recall (R) is the ratio between correct estimations and the total number of actual occurrences. F1-Score (F) is the harmonic mean of P and R.
For localization, we define TP, FP, and FN based on the intersection over union (IOU) operator between estimations,d j t , and annotations, d j t : where \ and [ are the intersection and the union operators, and IOU 2 [0, 1]. An example for face localization, with highlighted TP, FP and FN detections is given in Figure 5.
We consider the variation of R for di↵erent person-signage distances. Let A(·)  Note that the person marked with a red bounding, even that his face is visible, we consider he has no OTS as he is walking in a direction opposite to the location of the camera. The frame is cropped.
camera are annotated by a larger bounding box than for those who are farther. We also consider the variation of R in presence of occlusions. Let o j t ∈ {non occluded , partially occluded , heavily occluded } be the annotated occlusion. The three occlusion bands are non occluded when the annotation is not occluded; partially occluded when the annotated area is occluded less than 50%; and heavily occluded when the annotated area is occluded more or equal than 50%.
Note that we only report R, and not P, for personsignage distance and occlusion as false positives (necessary to compute P) cannot be unequivocally estimated when the annotations are divided in regions such as far /close (for distance) or non occluded /partially occluded /heavily occluded (for occlusion). For instance, a far /close estimation might not match with any far /close annotation but this does not necessarily imply to be a false positive as it might be matching with a close/far annotation. Counting We quantify the performance of the localization algorithms for the task of people counting with the follow-ing performance measures: Mean Opportunity Error (MOE), Cumulative Opportunity Error (COE), and Temporal Cumulative Opportunity Error (TCOE).
The Mean Opportunity Error (MOE) quantifies the ability of an algorithm to count people with OTS at a specific time t,n t , and it is calculated with respect to the actual number of people with OTS at t, n t : MOE ≥ 0 ( Figure 4c) and its optimal value is MOE = 0. We analyze how MOE varies with the personsignage distance, as in localization, and when the input video frame rate is reduced. We show a visual sample of MOE in Figure 5. The Cumulative Opportunity Error (COE) quantifies the ability of an algorithm to count unique people with OTS and it is calculated with respect to the actual cumulative number of people with OTS for the whole video, n 1:T : where max(·) is the max operation. COE ≥ 0 and its optimal value is COE = 0. This performance measure is normalized with respect to the actual cumulative number of people; therefore, COE indicates the ratio of error with respect to the actual cumulative number of people. The Temporal COE (TCOE) quantifies the ability of an algorithm to count unique people with OTS over temporal segments of generic duration (e.g. 10 second duration), and it is calculated with respect to the cumulative number of unique people with OTS: |n t:t+D − n t:t+D |, (6) where T D,T = {1, 2, 3, . . . , T − D} is the set of the initial frame of each segment, D < T is the duration in frames of the segments, and |T D,T | is the total number of segments. We consider values that correspond   to the typical duration of digital out-of-home advertisements D = {10, 20, 30, 60, 90, 120} γ, where γ is the video frame rate in fps (i.e. segments of 10, 20, 30, 60, 90, 120 second duration). TCOE considers all possible D-frame segments within the video. We show an example for 10 second segments in Figure 6. TCOE D,T ≥ 0 and its optimal value is TCOE D,T = 0. Note that when D = T , TCOE equals COE.
To quantify whether algorithms estimate all people in the field of view but fail to only estimate the ones with OTS, we define two accessory measures, Mean People Error (MPE) and Cumulative Person Error (CPE). If p t is the number of people at t and p 1:T is the number of people in the whole video, then and CPE = |n 1:T − p 1:T | max(p 1:T , 1) .

Attributes
We compute the per-class precision, recall, and F1-Score for age and gender estimation. We consider the variation of F for different person-signage distances and for different occlusion levels.
We give now a few examples of the definitions of TP, FP, and FN for each attribute. For age estimation and the class [19,34], a TP is a correct age estimation. To relax the hardness of the age ranges, we consider overlapping age ranges with ± 2 years, as shown in Figure 7, (e.g. an estimation of a person of 17 years will be a true positive if the actual age of the person is in [0,18], or [19,34]); a FP is an incorrect age estimation for a person from another age-class; and a FN is an incorrect estimation of the age of a person that belongs to the class. For gender estimation and the class female, a TP is a female estimated as female; a FP is a male estimated as female; and a FN is a female estimated as male.
For algorithms able to output unknown as a possible class, the corresponding estimations will contribute neither as TP, FP nor FN. We show an example of attribute evaluation in Table 1.

Methods
Algorithms must be accurate and perform in real-time for being suitable to generate reliable and useful AVA. Therefore, we select algorithms for benchmarking that obtain close to state-of-the-art results, that are causal (i.e. they only need past and present information), and that are able to perform close to real-time in the defined settings. We select algorithms that are compatible with GPU and OpenVINO 3 optimization for fast CPU computation.
We use RetinaFace [14] and Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN) [15] as detection algorithms; Simple Online Real-time Tracking with a Deep Association Metric (DeepSORT) [16], and Towards Real-Time Multi-Object Tracking (TRMOT) [17] as trackers; and FaceLib [18] and Deep EXpectation of apparent age from a single image (DEX) [19] as age and gender estimators. In addition to the above baseline algorithms, we benchmark two commercial solutions, Commercial-1 (C1) and Commercial-2 (C2), which we maintain anonymous. Localization, age, and gender estimation algorithms for AVA are described next and summarized in Table 2.
Commercial-2 Face +++ +++ +++ ++ A4 was not compatible with OpenVINO optimization at submission time, thus we run A4 on CPU without optimization. Algorithm 1 (A1), RetinaFace [14], is a face detector with a single-stage pixel-wise dense localization at multiple scales that uses joint extra-supervised and selfsupervised multi-task learning. The algorithm predicts a face score, face box, five facial landmarks, and their relative 3D position, using input images resized to a resolution of 640×640 pixels. The algorithm is trained on the WIDER FACE dataset [20].
Algorithm 2 (A2), MTCNN [15], is a face detector with a cascaded structure and three stages of deep convolutional networks that use the correlation between the face bounding box and the landmark localization to perform both tasks in a coarse-to-fine manner. The first stage is a shallow network that generates candidate windows. The second stage rejects false positive candidate windows. The third stage is a deeper network that outputs the locations and facial landmarks. The learning process uses an online hard sample mining strategy that unsupervisedly improves the performance. The algorithm is trained on the WIDER FACE dataset [20].
Algorithm 3 (A3), DeepSORT [16], is a multi-object tracker that combines the detector YOLOv3 [21] and the tracker SORT [22]. YOLOv3 detects the body of people with a single neural network that divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. The predictions are informed by the global context of the image. With respect to the tracking module, DeepSORT uses Kalman Filter and the Hungarian algorithm for performing association of detections over time. In addition, DeepSORT employs a convolutional neural network, trained to discriminate people, that combines appearance and motion information.
Algorithm 4 (A4), TRMOT [17], is a multi-object tracker based on the Joint Detection and Embedding (JDE) framework. JDE is a single-shot shared deep neural network that simultaneously learns detection and appearance features of the predictions. The algorithm is based on Feature Pyramid Network [23] that makes predictions at multiple scales. Then, embedding features are up-sampled and fused with the feature map from higher feature maps levels by using skip connections to improve the tracking accuracy for people far from the camera (i.e. small bounding boxes). The JDE framework learns to generate predictions and features simultaneously.
Algorithm 5 (A5), FaceLib [18], is an open-source repository for face detection, facial expression recognition, and age and gender estimation. The age and gender estimation modules use as input the true positive detections generated by RetinaFace (A1), and they use a ShuffleNet V2 with 1.0x output channels [24] as architecture. FaceLib is trained on the UTKFace dataset [25].
Algorithm 6 (A6), DEX [19], estimates the apparent age and gender using as input the true positive detections generated by RetinaFace (A1). DEX uses an ensemble of 20 networks on the detected face and it does not require the use of facial landmarks. This work introduces and uses the IMDB-WIKI, a large public dataset of face images with age and gender annotations.
Commercial off-the-shelf 1 (C1) is composed of a frontal face detector based on a cascade detector that uses a variety of gray-scale local features, a tracker designed for frontal, eye-level images; and a gender and age classifier based on regression trees.
Commercial off-the-shelf 2 (C2) uses a tracker based on a Kalman Filter for tracking people specifically designed to be robust to occlusions; and a gender and age classifier based on deep learning.
We employ A1-4 and C1-2 to estimate the instantaneous and cumulative number of people with OTS. The instantaneous number of people at t is estimated as the number of detections/tracks at the current time. The cumulative number of people between any two instants, t 1 and t 2 , is estimated as the number of unique identities j during the considered time segment. Detectors (A1-2) simply assign a new identity to each detection without considering temporal relationships, thus producing an overcount. As trackers require detectors and a temporal association of detections mechanism to prevent multiple counts of the same person over time, in general trackers require more computational resources Pedestrian-2 Pedestrian-3 Pedestrian-4 Pedestrian-5 Subway-1 Subway-2 Subway-3  We make the codes and pre-trained model weights of all baseline algorithms available at the project website. Without altering the core of the original algorithms, we integrate changes to ensure that every algorithm processes the videos and generates the outputs following the same procedure. The codes for all baseline algorithms have been modified to enable both GPU (NVIDIA inference) and CPU (OpenVINO 4 inference) computation in PyTorch (Table 2).

Videos
The dataset was collected in settings that mimic realworld signage-camera setups used for AVA. The dataset is composed of 16 videos recorded at different locations such as airports, malls, subway stations, and pedestrian areas. Outdoor videos are recorded at different times of the day such as morning, afternoon, and evening. The dataset is recorded with Internet Protocol or USB fixed cameras with wide and narrow lenses to mimic real-world use cases. Videos are recorded at 1920 × 1080 resolution and 30 fps. The dataset includes videos of duration between 2 minutes and 30 seconds, and 6 minutes and 26 seconds, totaling over 78 minutes, with over 141,000 frames. The videos feature 34 professional actors with multiple ethnicities, with ages from 10 to 80, and including male and female genders. People have been recorded with varied emotions while looking at the signage. A sample frame of each location is shown in Figure 8. We show sample frames with a reduced number of people for facilitating the visualization of the background. For the mall location, two videos are at different times: indoors (Mall-1/2) and outdoors (Mall-3/4).

Annotations
A professional team of annotators used the Intel Computer Vision Annotation Tool [26] from OpenVINO ecosystem to fully annotate all videos with the following attributes (Figure 9): bounding boxes for face and body of the people, identity, age, gender, attention, pose, orientation, and occlusions. Annotations were generated for every key-frame. The key-frames were selected depending on the behavior and the location with respect to the camera of the person to be annotated. People closer to the camera (or moving faster) were annotated more often than people that are farther away (or moving slower). Inter-key-frames annotations were generated using linear interpolation. The annotations were validated by expert annotators that checked the consistency of person-face groups, person identity within a video and across videos, and an individual visual inspection of each of the annotated attributes. For preventing the analytics to focus on very small (far from signage) people, who are likely to not have OTS, and to simplify the annotation process, we define a region in some scenarios where people are omitted, and thus not annotated. We refer to these regions as ignore area, shown as a white shading in Figure 8. Estimations within the ignore areas are also omitted. The annotations maintain the identity of each person throughout the same video, even if the person exits and re-enters into the field of view, and even across videos. However, for the purpose of AVA benchmarking, when an actor exits and re-enters into the field of view of a camera within the same video after more than 10 seconds, we consider the same actor to have a new identity. Each video includes a range between 11 and 158 unique people. The dataset annotation includes a total of 785 unique people, and over a million anno-   Table 5: Input frame rate for each localization algorithm and system when using CPU (GPU) inference. The selected frame rate allows the algorithmsystem combination to achieve (near) real-time processing speed.  Table 3.

Benchmark -Results and discussion 4.1 Experimental setup
We evaluate the performance of the algorithms on four systems that enable on-the-edge AVA processing. The systems' properties are summarized in Table 4. The algorithms are executed in both GPU and CPU (one core), separately. GPU inference is used for systems with an integrated NVIDIA GPU. CPU inference is used for all systems, and it can be native (i.e. without optimization) or optimized using OpenVINO. The use of OpenVINO optimization depends on the system and the algorithm. Systems must be equipped with an Intel processor, and algorithms must be compatible with the OpenVINO optimization. For the baseline algorithms (A1-6), all algorithms but A4 are OpenVINO compatible. We define as real-time processing the capability of a system-algorithm pair to complete the analytics for I t before the new data frame I t+1 is available. When a system-algorithm pair does not achieve real-time processing (e.g. input data is at 30 fps and the processing speed is at 1 fps), one can reduce the frame rate and/or resolution of the videos. We decide to maintain the resolution of the videos and intentionally reduce the frame rate (i.e. drop frames) of the input videos to ensure that every system-algorithm pair performs (near) real time. Table 5 shows the input frame rate of the videos that we use for all experiments for every systemalgorithm pair. For System 1, all algorithms run with 30 fps videos with both GPU and CPU inference. For System 2, all algorithms run with 30 fps videos with GPU and 3 fps videos with CPU, except for A4 with CPU that runs with 1 fps videos. For Systems 3 and 4, all algorithms run with 3 fps videos, except for A4 that runs with 1 fps videos.
Most of the results shown next are represented by box plots. The horizontal line within the box shows the median; the lower and upper edges of the box are the 25-percentile and 75-percentile; and, the bottom and top edges show the minimum and maximum values.
We provide an online evaluation tool that allows one to effortlessly assess the performance of AVA algorithms in the proposed dataset. Further information regarding the requested data format and use of the  Figure 15 shows the evaluation results of A1-2 on face localization, and A3-4 on person localization. Regarding the detectors, A2 obtains a higher precision but lower recall than A1. This indicates that A1 generates a larger amount of false positives than A2. However, the median results for F1-Score show that A1 outperforms A2 by a small amount.  ference units (i.e. GPU vs CPU) except for A1, which has a smaller error range when executed in CPU. When the task is to count the cumulative number of unique people with OTS, Cumulative Opportunity Error (COE) indicates that tracking algorithms obtain more accurate count than detection algorithms. With System 1, while the tracking algorithm A3-4 obtains a median COE under 1.02 and 4.20, respectively; detection algorithms A1-2 obtain a median COE of at least two orders of magnitude higher. In this case, algorithms that are using the GPU for inferring obtain in general higher error than when using CPU. This is due to the fact that algorithms that process more frames (i.e. GPU) are more prone to overcount the cumulative number of people than when processing fewer frames (i.e. CPU). Algorithms using CPU inference use as input videos with lower frame rate than when using GPU inference (e.g. System 2 uses 3 fps videos with CPU inference and 30 fps videos with GPU inference), thus the overcount is likely to be reduced as the number of processed frames is reduced. This effect can also be seen in the results with System 1, where CPU and GPU use the same video frame rate, therefore, obtain very similar results regardless of the inference type. When the task is to estimate the cumulative number of people over a specific segment of time, Temporal Cumulative Opportunity Error (TCOE) results show that the error increases monotonically when the duration of the segment (D) increases ( Figure 11). Also, it can be observed that detectors (A1-2) obtain several orders of magnitude higher TCOE than trackers (A3-4). With System 1 GPU, trackers obtain TCOE of 1.5-19 (A3) and 6-70 (A4). The most accurate algorithm for this task is A3, which can estimate the cumulative number of people with OTS for 10 (120)-second segment video with a median TCOE of ±2.20 (±18.81), with System 1 and GPU. Error vs input frame rate.

Counting
Results indicate that the input frame rate of the videos has an important effect on the performance of ). Input frame rates above 1 fps for A4-CPU have been omitted as the algorithm becomes very slow. Longer segments produce a monotonically increase of the error, with steeper slope for detectors (A1-2) than for trackers (A3-4). Higher input frame rates produce a monotonically increase of TCOE for detectors (A1-2) and for the tracker A4. However for tracker A3, TCOE decreases, with a minimum around 6 fps, after which it increases. The considered input frame rates are 0.25, 0.5, 1, 2, 6, 7.5, 10, 15, and 30 frames per second.
the algorithms. Thus, we analyze the trade-off between the error obtained by the algorithms and the input frame rate of the videos. This analysis is divided into two experiments. In the first experiment, we intentionally modify the input frame rate of the videos to γ = {0.25, 0.5, 1, 2, 6, 7.5, 10, 15, 30} fps, and then, we compare the performance of the localization algorithms in terms of MOE, MPE, and COE. Figure 12 shows the results. For MOE and MPE, in the first row, the errors are mostly flat. This indicates that those performance measures are not affected by the input frame rate of the videos. This is expected as these two measures are performed on a frame-by-frame basis. Therefore, when reducing the frame rate of the input videos, the number of frames that the algorithms compute is reduced but this reduction does not affect to the performance of the algorithms, as non-seen frames are used neither for inference nor evaluation. However, when computing cumulative counts across time, the performance of the algorithms can be affected. In fact, the second row of Figure 12, we observe that COE varies for A3-4 when the input frame rate of the videos change. COE monotonically increases when the frame rate increases for detection algorithms (A1-2). The performance obtained by A3 does not change substantially for different input frame rate, and its minimum occurs around 6 fps. Surprisingly, COE monotonically increases for A4 when the frame rate increases. However, when considering absolute values, the error is considerably lower than the one obtained by detection algorithms. This indicates that higher frame rates do not necessarily ensure higher counting performance. In the second experiment, we  When compatible with the system, the results are reported with inference in GPU (filled boxes) and CPU (non-filled boxes). Note that A6 cannot run in System 4 due to lack of memory. While algorithms do not obtain a good performance for age estimation, especially for younger and older age ranges; a better performance is obtain for gender estimation. For each task, both algorithms perform similarly.
apply the same frame rate reduction than in the previous experiment, and we compute TCOE for different segment duration of D = {10, 20, 30, 60, 90, 120} γ frames (i.e. 10, 20, 30, 60, 90, 120 seconds). The results are in Figure 13. Results in the first two rows indicate that TCOE monotonically increases for detection algorithms (A1-2) when the frame rate increases, and also when the duration of the segment increases. As mentioned before, this occurs as detectors have no skill for counting the cumulative number of people. When the segment increases its duration, the overcount accumulates further, thus TCOE also increases. Unlike detectors, trackers (A3-4), shown in the last two rows of Figure 13, obtain a more stable TCOE for any segment duration and frame rate. Segments with larger duration produce a slight increment in the TCOE. Note that as A4 cannot be optimized using OpenVINO for CPU computation, we limit the input frame rate of the videos up to 1 fps as higher frame-rate videos make the algorithm very slow. In this experiment, A3 is the most accurate baseline algorithm under comparison. Visual and detailed per-video results obtained by each algorithm are available at the project website.

Age and gender estimation
Age and gender estimation algorithms require as input a crop with the person's face. We employ the face detector that obtains the highest recall (A1) as detector.
To fairly compare the age and gender estimation algorithms and disregard the localization performance, we compute the age for correctly detected (i.e. true positive) faces only. The results with A5-6 for age estimation are shown in Figure 14 (first three columns), and for gender estimation Figure 14 (last three columns).
Age estimation results indicate that this task is challenging as the highest median F1-Score obtained is below 0.7. While the highest results are obtained for classes [19,34]  Gender estimation results indicate that A5-6 algorithms obtain a similar F1-Score performance for both classes (male, female). For the male class, both algorithms obtain a precision between 0.4 and 0.8 with a recall between 0.6 and 1.0. The opposite behavior happens for the class female, obtaining higher precision than recall. For this task, results indicate that all systems and inference types behave similarly. Note that A6 is not able to run in System 4 due to memory limitations.

Commercial solutions
As we did not have access to the source codes, the commercial solutions (C1-2) are executed in two external systems equipped with an Intel i7 CPU. To preserve the integrity of the complete dataset, until submission time, we only shared a subset of the dataset with the creators of C1-2. Thus, for this comparison, we use a subset of the dataset that is composed of the following videos Airport-1, Airport-2, Mall-3, Mall-4, Pedestrian-2, and Pedestrian-3. As a reference, we show next the results obtained by the baseline algorithms (A1-6) with System 1 GPU in the same subset of videos.
The results of this experiment are reported in Figure 16. For localization, A1-2 and C1-2 are evaluated for face localization, and A3-4 are evaluated for person localization. Results show that both commercial solutions (C1-2) perform similarly than A2 for all performance measures, while A1 outperforms other face detectors obtaining a recall higher than 0.4 and simi- , and Cumulative Opportunity Error (COE); and age and gender estimation in terms of F1-Score. All algorithms have been executed with the same video inputs (at 30 frames per second). A1-6 are run in System 1 with GPU, and C1-2 are run in CPU. Note that A1-2 obtain a COE several orders of magnitude larger than the rest of the algorithms; thus, these results are out of the chart ranges for easing the visualization. The company developing C2 defines OTS as any person within the field of view, differently than the definition proposed in this paper. To have this into consideration, we show two bar plots for C2. The one with the highest error corresponds to using the definition proposed in this paper, and the other one using the definition proposed by C2 (i.e. CPE, Eq. 8, is reported). When we consider the ability of the algorithms for counting the cumulative number of people for different segment duration, algorithms obtain an average TCOE between 1 and 50 ( Figure 17). C1 is again the algorithm that obtains the lowest error for all segment duration, achieving TCOE between 2 and 4.
In addition to showing the results using the proposed performance measures, we show the raw estimated and annotated counts for a set of selected videos in Figure 19. We can observe that for instantaneous count most of the algorithms often estimate the number of people in the surrounding of the annotated value. An exception here is A1, which often generates multiple false positives, thus overestimating the actual number of people with OTS for a given instant. Regarding the cumulative count (second column of Figure 19), and as done for COE, we show the estimated cumulative count of unique people with OTS from the initial frame. The charts indicate that the most accurate algorithms in this task are C1 and A3; and that the cumulative count is a challenging task and even state-of-the-art trackers (i.e. A4) and commercial solutions (i.e. C2) might drift from the actual number of people. The detailed per-video results are available at the project website. Figure 18 shows the statistics of the execution speed of the baseline algorithms in the four systems with both CPU and GPU, and of the commercial solutions. Considering the baseline algorithms (A1-6), the fastest algorithms are the age and gender estimation algorithms (A5-6), followed by the detection algorithms (A1-2). The tracking algorithms (A3-4) are the slowest ones. All six baseline algorithms run close to realtime (i.e. 30 fps), indicated in the charts by the black dashed line, when considering Systems 1-2 with GPU. In general, the chart confirms that GPU inference is the fastest, followed by CPU inference with OpenVINO optimization, and the slowest is non-optimized CPU inference. For instance, A1 with OpenVINO optimization increases its average execution time by 3.6 times (from 0.10 s per frame, in GPU, to 0.36 s per frame, in CPU) in System 1. In the same system, A4 (without OpenVINO optimization) increases its average execution time by 19 times (from 0.08 s per frame, in GPU, to 1.52 s per frame, in CPU).

Conclusion
We proposed an open-source benchmark for the evaluation of Anonymous Video Analytics (AVA) for audience measurement and released to the research community the first fully-annotated dataset that enables the evaluation of AVA algorithms. Using this benchmark, we conducted a set of experiments with eight baseline algorithms and two commercial off-the-shelf solutions for the tasks of localization, counting, age, and gender estimation. All the tasks are evaluated in four systems, with CPU and GPU. Results showed that trackers perform better than detectors in all scenarios, that localization algorithms should improve when objects are far/occluded (Figures 15 and 10), and that the use of higher input frame rate videos do not ensure a better performance. Further efforts should be made towards the design of holistic tracking solutions that synergistically consider body and face to account for robustness and attention attributes. The performance of age estimation algorithms is limited and results suggest that the performance in estimating the age for younger, [0,18], and older people, [65+], is degraded. This might be due to an existing bias in the datasets used for training age estimation algorithms. Based on the outcomes of the benchmark, future work could explore the design of improved AVA algorithms for age estimation and cumulative count based on multiple object tracking, as well as for attention to evaluate the audience responsiveness to an advertisement.

Availability of data and materials
The dataset, benchmark and related open-source codes are available at http://ava.eecs.qmul.ac.uk. Note that the y axes are limited to 20 (instantaneous) and 150 (cumulative), and that x axes are limited to 10000 frames.