Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust?

Wien, Mathias; Jung, Joel

doi:10.1186/s13640-024-00630-7

Research
Open access
Published: 17 June 2024

Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust?

EURASIP Journal on Image and Video Processing volume 2024, Article number: 16 (2024) Cite this article

100 Accesses
Metrics details

Abstract

We present a study on the validity of quality assessment in the context of the development of visual media coding schemes. The work is motivated by the need for reliable means for decision-taking in standardization efforts of MPEG and JVET, i.e., the adoption or rejection of coding tools during the development process of the coding standard. The study includes results considering three means: objective quality metrics, remote expert viewing, which is a method designed in the context of MPEG standardization, and formal laboratory visual evaluation. The focus of this work is on the comparison of pairs of coded video sequences, e.g., a proposed change and an anchor scheme at a given rate point. An aggregation of performance measurements across multiple rate points, such as the Bjøntegaard Delta rate, is out of the scope of this paper. The paper details the test setup for the subjective assessment methods and the objective quality metrics under consideration. The results of the three approaches are reviewed, analyzed, and compared with respect to their suitability for the decision-taking task. The study indicates that, subject to the chosen test content and test protocols, the results of remote expert viewing using a forced-choice scale can be considered more discriminatory than the results of naïve viewers in the laboratory tests. The results further that, in general, the well-established quality metrics, such as PSNR, SSIM, or MS-SSIM, exhibit a high rate of correct decision-making when their results are compared with both types of viewing tests. Among the learning-based metrics, VMAF and AVQT appear to be most robust. For the development process of a coding standard, the selection of the most suitable means must be guided by the context, where a small number of carefully selected objective metrics, in combination with viewing tests for unclear cases, appears recommendable.

1 Introduction

In the context of the development of compression algorithms for visual media, the determination of compression efficiency and quality improvements plays a crucial role. For conventional 2D video, the standardization groups MPEG, VCEG, and JVET have developed common testing procedures to allow for a fair comparison between a so-called “anchor”, which represents the performance of a test model, which in turn implements the draft standard at a certain point in time, and the so-called “proposal”, which attempts to improve the performance of this test model. These testing procedures include defined test sets and encoder configurations for assessing the compression performance, known as Common Test Conditions, e.g., [1]. Since these testing conditions typically remain stable over a long period in the development process, this also enables the groups to assess the improvement in performance over time. It is remarkable that for 2D video, while being systematically questioned by the expert community, the assessment typically relies on the Peak Signal-to-Noise Ratio (PSNR), i.e., the pixel-based Euclidian distance between the source and the reconstructed compressed signal, with elaborate methods and procedures for evaluation [2]. While this approach does not necessarily imply decisions towards best visual performance, it has been observed that verification testing consistently indicates substantial visual gains, see, e.g., the verifications tests for Versatile Video Coding (VVC) [3,4,5]. Such verification testing is usually performed at the end of the standardization process, with laboratories conducting a formal subjective assessment of the compression performance of the coding [6]. Similarly, in the contexts of 3D and immersive video standardization, objective metrics are considered. In MPEG Immersive Video (MIV) [7], which deals with the coding of multiple views (textures and depth maps) to enable free navigation in the scene with six degrees of freedom (called 6DoF), common test conditions are similarly derived to guide the adoption process [8]. Given the challenge of quality assessment of immersive video, no unique metric was considered suitable for the development process. Instead, a combination of metrics is employed. The list of considered metrics has evolved over time: from an initial set of five metrics (Video Multimethod Assessment Fusion (VMAF), Structural Similarity Index Measure (SSIM), PSNR, Weighted Spherical PSNR (WS-PSNR), Immersive Video PSNR (IV-PSNR)), only two of them are currently reported (PSNR, IV-PSNR). In case of debatable or contradicting results, visual checks or remote expert viewing sessions are performed. In MPEG V-PCC and V-Mesh activities (video-based point cloud compression and video-based mesh compression) [9] the list of considered metrics includes dedicated metrics, such as point-to-point D1, point-to-plane D2, yuvPSNR applied on texture maps, and uvPSNR applied on the uv-coordinates, i.e., the position of each texture coordinate vertex [10].

In the context of standards development for visual media, frequent subjective evaluation would be very helpful in assessing the progress in terms of visual quality improvement. At the same time, the related effort is quite high, both in terms of human resources and time. Furthermore, the question remains unresolved as to what extent the employed objective metrics may be considered reliable when it comes to the visual quality impact of specific coding tools. This question motivates the investigation and development of suitable means for decision-taking for any type of visual media (2D, 360$^\circ$, immersive videos, point clouds, meshes, etc.). The focus in this effort is on reliable means for decision-taking, i.e., the assessment of often very tiny compression improvements of a proposed change to an anchor. The evaluation method to be established (by subjective testing and/or using objective metrics) must be reproducible, and understood and accepted by the standardization group to serve the intended decision-taking usage. The problem scales with the increase in the degree of freedom of user interaction and assessment. While the viewing conditions for conventional 2D video may be inherently defined by presenting the compressed video sequence on an suitable device, the coding of immersive video implies the choice of an individual viewing perspective by the user. The intended use of immersive visual media relies significantly on interactivity and may allow for an assessment from virtually any viewing directions and viewing paths. However, this increased degree of freedom further implies the use of an extended processing chain operating between the decoding of the compressed visual media signal and the chosen assessment device, such as conventional monoscopic or stereoscopic displays, head-mounted displays, or mobile devices which, e.g., allow for navigation in the scene by movement of the device. Hence, multiple additional aspects such as mono- or stereoscopic rendering or user interaction for view path selection arise and the testing task becomes even more challenging.

In ISO/IEC JTC 1/SC 29, the Advisory Group 5 MPEG Visual Quality Assessment (AG 5) is tasked with the investigation, development, and recommendation of tools and methods for this purpose [11]. AG 5 has developed a remote expert viewing (REV) protocol [12] with the goal of a) providing a reliable means for ranking the visual quality of the proposal and the anchor in the development process, and of b) testing objective metrics for their suitability for doing the same task. The method originally relied on remote assessments under the conditions of the COVID pandemic in the years 2020-2022. It is similarly applicable for on-site use, e.g., with experts attending a standardization meeting in presence. The REV scheme has been adopted by the MPEG working groups on Video and 3D Graphics, as well as JVET, for various purposes such as tool development or the preparation of calls for proposals or verification tests, e.g., [13].

Numerous studies have analyzed the correlation between objective metrics and MOS. Similarly, comparisons of subjective methods have also been widely researched. For instance, studies [14] and [15] evaluated subjective methods in the context of mobile video and 3D video, respectively, and more recently, in the context of virtual reality as seen in [16]. Due to the extensive literature on this topic, we focus our description on the most recent studies. [17] has explored the feasibility of performing subjective tests with a limited number of viewers but with repetitions. Study [18] compares the DCR (Double Stimulus Continuous Quality Scale) with the EVP (Enhanced Video Perception) rating scale to the traditional ACR-HR (Absolute Category Rating with Hidden Reference) approach. Finally, in 2024, a study was published comparing omnidirectional video and spatial audio conditions in terms of subjective quality and the corresponding impact on the resolving power of metrics [19].

In this paper, a study on the problem of quality assessment and decision-taking is presented considering three perspectives: objective metrics, remote expert viewing, and formal visual evaluation under laboratory conditions. The focus is on the evaluation of pairs of coded video sequences, which typically comprises a proposed change to a video coding scheme scheme and the unmodified version. The original and the changed version are compared at a one or more selected rate points. No further aggregation, e.g., as provided by the Bjøntegaard Delta rate, is regarded here. To lower the dimensionality of the general problem, the focus is on test material from 2D video compression which has been assessed by the REV method in JVET. Most of the test content represents tool on/off tests, i.e., the comparison of a proposed change in the coding scheme to the anchor represented by the unchanged reference software. An extension to immersive visual media compression is future work, in which aspects such as user interaction and different playout devices may be considered.

The main questions to be addressed can be phrased as follows: is the REV method functional? Are objective metrics able to indicate the correct decision? And, considering the standardization scope of this work, is it possible to detect difficult cases, e.g., by objective metrics, such that an indication of the need of some type of subjective quality assessment can be drawn?

The paper is organized as follows: in Sect. 2, the data set, the REV method, and laboratory test are presented. Section 3 details the assessment of the data set by objective metrics and presents the results of the objective evaluation of the data set. In Sect. 4, the results of the REV tests and the laboratory test are analyzed with respect to consistency, reliability, and potential challenges. In Sect. 5, an overall discussion of the objective and subjective results is presented and potential answers to the question in the title are provided. We conclude the paper in Sect. 6.

2 Data set and visual test methodologies

2.1 Content description

The data set used in this study for comparison of Remote Expert Viewing (REV) tests and laboratory viewing (LAB) tests comprises test results acquired in a series of six JVET meetings during the COVID period from 2021 to 2022. It comprises a total of 232 test points, including trapping sequences inserted into the test sessions for control purposes. All test points have been evaluated by JVET video coding experts. The results are reported in [20,21,22,23,24,25]. The REV tests were mostly conducted in the context of an exploration experiment called EE1, investigating the compression efficiency improvement of neural network-based (NN) coding tools. Furthermore, modifications to the deblocking filter applied to both JVET test models, the VVC Test Model (VTM) and the Enhanced Compression Model (ECM), were investigated, and new test sequences were evaluated. The full data set, called DS, is presented in Tables 14 and 15 in the Appendix of this paper. In the REV tests, each test point is compared to its corresponding anchor (VTM in the case of EE1, VTM and ECM, respectively, for the deblocking filter tests, and the HEVC test model (HM) for the exploration of new test sequences). For this study, the data set has been divided into six categories listed below:

Loop filters (LF): This category includes all NN-based proposals for in-loop enhancement filters, NN-based deblocking filters, combinations of these, as well as tests for modifications of the conventional deblocking filter.
DNN super-resolution (DNN-SR): This category includes all proposals for NN-based re-scaling, where the test sequence is coded at a lower resolution (subsampled by a factor of two in both horizontal and vertical directions), and subsequent up-sampling with a proposed NN-based method.
DNN decomposition - compression - synthesis (DNN-DCS): This category includes proposals with a modified coding loop where texture detail is represented at full spatial resolution while temporal changes are encoded at a lower spatial resolution.
Reference picture resampling (RPR): RPR is a coding tool in VVC enabling the change of picture resolution within a coded video sequence. It can be used for coding a sequence at a lower resolution and upscaling it with the standardized RPR method. Since it is readily available with VVC, it is used as a reference point for proposals in the DNN super-resolution category in the context of the JVET exploration experiments.
Comparing HM and VTM (HM-VTM): A comparison of the HM and the VTM was performed in JVET for studying properties of new test sequences which were considered for potential inclusion in the set of test sequences of the common test conditions.
Trap (TRP): This category includes control test points inserted into the REV test sessions to verify the validity and consistency of the rating of the participants. Such traps could e.g., consist of a sequence coded at two clearly different qualities, or comparing a compressed sequence to an uncompressed one. In either case, the incorrect scoring of a participant (e.g., rating the compressed version over the uncompressed original) indicates either a lack of attention, or issues with the setup at the remote participants site, or other problems.

2.2 Content selection for laboratory tests

Due to the size of the full data set, a formal subjective evaluation of all test points in a formal laboratory test was not possible. Therefore, a subset was created, referred to as DS-LAB in the following. It was created by manually selecting test points according to the criteria listed below.

“INC”, which shows inconsistent or unclear results for the objective metrics under evaluation. Test points in this class, e.g., show diverging results among objective metrics or in comparison to the subjective scores from the REV tests.
“SIG”, where the REV revealed a Differential Mean Opinion Score (DMOS) close to zero and a confidence interval overlapping or touching the zero line, i.e., not or almost not significant.
“LC”, which shows a large confidence interval in the REV but is clearly removed from DMOS = 0. The large size of the CI is taken as an indication that participating experts scored differently, which indicates the potential occurrence of artifacts which might be difficult to rate, either subjectively or objectively. Such cases may occur, e.g., if a proposal shows more details, yet also more artifacts than the anchor.
“OPP”, which shows opposite results at two tested rate points, e.g., the proposal is better at the low rate but worse at the high rate.
“DIF”, that shows a clear difference between the anchor and the proposal under test in terms of DMOS for the REV. The confidence interval does not include DMOS=0. These test points are considered clear cases.

Based on these criteria 52 pairs of test sequences were identified as candidates for the LAB test. The selected points are marked in bold font in Tables 14 and 15 in the Appendix. They are further tagged with bold “INC”, “SIG”, “LC”, “OPP” or “DIF” to indicate the applicable criterion for the respective test points. By design, the 52 pairs of test sequences in DS-LAB are considered to be difficult for the expert viewers to score. Hence, they also are expected to be difficult to score for the naïve viewers, and for the objective metrics.

2.3 REV methodology

ISO/IEC JTC 1/SC 29 Advisory Group 5 MPEG Visual Quality Assessment has developed guidelines for Remote expert viewing (REV) [12]. The guidelines have been developed for the purpose of enabling visual quality assessment during online standardization meetings. They are based on established test protocols, such as ITU-R BT.500, ITU-T P.808, and ITU-T P.910 [26,27,28], and provide recommended steps in terms of the preparation of the video sequences to be tested, as well as preparation and implementation procedures. The REV method has been applied in the context of JVET exploration experiments for 2D video, for core experiments in the development of MPEG Immersive Video (MIV), and in the preparation of verification tests for video-based point cloud coding (V-PCC). In most cases, the REV is used for the comparison of a proposed technology to the previously established anchor. Due to its high discriminatory power, Comparison Category Rating (CCR) [27] is recommended for this purpose, and used in the presented study. Other protocols can also be applied. REV sessions using Absolute Category Rating (ACR) and Degradation Category Rating (DCR) methods [28] were also conducted [13, 29]. To enable easy application, the guidelines rely on the use of widely available open-source software (ffmpeg [30], VideoLAN VLC [31]) for preparation and viewing. The REV method is briefly presented in the following.

2.3.1 REV procedure

For conducting a REV session, the group appoints a test coordinator and selects the content to be visually evaluated. The coordinator leads the test effort and reports the results to the group. For immersive video content, one or several camera paths (sometimes called viewport or pose trace) are defined for each sequence under test. The decoded video sequences, or the rendered camera paths, of all rate points are converted into mp4 files for playout with VLC on the computers of the participants. The conversion is made via ffmpeg with a high-quality setting (constant rate factor parameter) to prevent the introduction of visible artifacts. Their duration is recommended to be in the range of 5 s to 10 s. The group appoints one or more cross-checkers to verify that the converted mp4 files match the intended content under test. The verified set of test sequences is provided to the test coordinator.

Volunteers from the group are selected as viewers for the REV sessions. They are expected to report any potential issues with visual acuity or color vision to the test coordinator. If the REV is conducted for the purpose of decision-taking in the adoption process for a proposal, then experts from the proposing institution should not participate to avoid potential bias. The viewers must confirm having suitable equipment and setups available as defined by the test coordinator. This includes a computer capable of smooth playout for the test sequences shown in the REV sessions, a suitable display, and reasonable viewing conditions, such as a calm room with indirect light not reflecting on the screen and a setup with the recommended viewing distance. The suitability of the technical setup is defined by the recommendations of the test coordinator and tested by the participating experts using a demonstration playlist with high-bit rate mp4-files.

Within a REV session, the viewers may be presented with multiple test sessions. The test coordinator provides the anonymized test sequences and the playlists in a zipped and password-protected package to the viewers. The viewers are required to have this data set downloaded and available at the time for the REV session. Further, the test coordinator provides the viewers with scoring sheets formatted to support the voting during the test sessions. The viewers are then instructed to note their scores on printouts of these sheets to minimize their effort during voting. In the REV session, the test coordinator first provides final instructions to the viewers. Furthermore, a training session for the viewers is conducted to verify that the test procedure and the rating scale are properly understood. Based on the training session scores provided to the test coordinator, the password of the package is disclosed, and the participants run the test sessions. The viewers must run each session without interruption and execute their voting during the voting periods of the Basic Test Cells (BTCs). Any operations such as pausing, re-play, or speed manipulation are not permitted. To avoid making the test twice, the viewers are requested to immediately provide their scores to the test coordinator after finishing the sessions. In practice, this is handled via a web interface with personalized access for each viewer.

2.3.2 REV rating scales

In the CCR scenario, two rating scales are suggested, depending on the tested material. For MIV, the 7-grade scale of ITU-T P.808 is employed. In JVET, a 4-grade scale has been established for expert viewing in the development phase of the VVC standard [32,33,34]. This scale applies a forced choice. The two scales are presented in Fig. 1. In this paper, results from REV sessions using the 4-grade scale are further studied.

The choice of a 4-grade scale is motivated by two considerations: (a) the two variants of the coded sequence under test are actually different, and (b) the experts participating in the tests as viewers are expected to be able to express an opinion on the observed differences.

2.3.3 Test session design

The test coordinator takes the complete set of provided test sequences and splits them into multiple test sessions. Each session is constructed out of a series of BTCs containing test sequences of the same resolution. They should not exceed a duration of 15 min. The test sequences are renamed for the purpose of anonymity. For the CCR method, each BTC consists of the uncompressed source sequence, a consecutive playout of versions A and B of the sequence, and a 5 s voting time. The presentation of the uncompressed source may be skipped in cases where no original sequence is available, such as immersive video. To increase the discrimination of small impairements, the A/B pair is shown twice, as suggested in variant II of double stimulus tests in ITU-R BT.500 [26].

Each test session includes a stabilization phase of two to four BTCs to allow for an adaptation phase for the viewers. In each BTC, the A/B playing order of the anchor and the proposal is randomized such that the viewers cannot guess the variant from the playing position. Furthermore, each test session includes at least one trapping BTC where the attention of the viewers is tested. This may include displaying the same sequence as variant A and B within a BTC or displaying two variants with a clearly known quality relation (e.g., two different quantizer settings of the same coding scheme).

2.3.4 Processing of REV results

After the results of the viewers have been received, they are processed by the test coordinator. The A/B randomization of the BTCs is reverted, leading to a consistent mapping of negative scores for the anchor and positive scores for the proposal. The scores of participants failing to vote correctly on the trapping BTCs of a single session are discarded for that session. If a participant fails for multiple sessions, the scores for all test sessions are discarded completely. In the case of a low correlation of the viewerâ€™s scores with the overall Comparison MOS (CMOS), further participants scores may be discarded. The applied criteria and the number of affected viewers is reported by the test coordinator in the REV report documents [20,21,22,23,24,25]. The number of participants and viewer rejections in the test sessions are reported in Table 1.

Table 1 Viewer rejection in the REV tests

Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust?

Abstract

1 Introduction

2 Data set and visual test methodologies

2.1 Content description

2.2 Content selection for laboratory tests

2.3 REV methodology

2.3.1 REV procedure

2.3.2 REV rating scales

2.3.3 Test session design

2.3.4 Processing of REV results

2.4 LAB test methodology

3 Objective metrics evaluation

3.1 Objective metrics

3.2 Full pairwise metric evaluation on the LAB test results

3.2.1 Methodology

3.2.2 Evaluation of the correct decision rate

3.3 Decision rate evaluation on the REV test pairs

3.3.1 Methodology

3.3.2 Evaluation of the correct decision rate

3.4 Discussion of full-set vs. REV pair results

4 REV CMOS vs. Laboratory DMOS

4.1 Results and analysis

5 Discussion

6 Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords