Skip to main content

Quantifying the importance of cyclopean view and binocular rivalry-related features for objective quality assessment of mobile 3D video


3D video is expected to provide an enhanced user experience by using the impression of depth to bring greater realism to the user. Quality assessment plays an important role in the design and optimization of 3D video processing systems. In this paper, a new 3D image quality model that is specifically tailored for mobile 3D video is proposed. The model adopts three quality components, called the cyclopean view, binocular rivalry, and the scene geometry, in which the quality must be quantified. The cyclopean view formation process is simulated and its quality is evaluated using the three proposed approaches. Binocular rivalry is quantified over the distorted stereo pairs, and the scene quality is quantified over the disparity map. Based on the model, the 3D image quality can then be assessed using state-of-the-art 2D quality measures selected appropriately through a machine learning approach. To make the metric simple, fast, and efficient, final selection of the quality features is accomplished by also considering the computational complexity and the CPU running time. The metric is compared with several currently available 2D and 3D metrics. Experimental results show that the compound metric gives a significantly high correlation with the mean opinion scores that were collected through large-scale subjective tests run on mobile 3D video content.

1 Introduction

Recently, with the rapid advances being made in 3D video technologies, mobile 3D video has become a subject of interest for both the entertainment and consumer electronics industries. Mobile 3D video offers a number of challenges, because it is expected to deliver a high-quality experience to the mobile users while using limited resources, including lower bandwidths and error-prone wireless channels. One of the greatest challenges is the evaluation of 3D video quality in a perceptual manner. Normally, a 3D video system includes several signal processing stages, e.g., scene capture and content creation, video format conversion, encoding, transmission, possible post-processing at the receiver side, and rendering and display of the image. Each stage may contribute to the degradation of the 3D visual quality, and the errors that occur at certain steps may propagate through the chain. Therefore, quality assessment (QA) plays an important role in the design and optimization of the system in relation to the prospective users, systems, and services.

QA of any multimedia content is best performed subjectively, i.e., by asking test participants to give their opinions on different aspects of the quality of the content that they experienced. While it is highly informative in that it directly reflects human perception, subjective evaluation has many limitations. It is a time-consuming and expensive process and is not suitable for real-time quality monitoring and adjustment of the systems. Therefore, research on objective QA usually follows the subjective studies to design algorithms that can automatically assess multimedia quality in a perceptually consistent manner. Consider, for example, a wireless multimedia network system: a server can be dedicated to the evaluation of the delivered content quality using objective QA measures, and the results can be used to control and allocate the streaming resources. At the encoding and decoding stages, objective QA can also be used to optimize the encoding and rendering algorithms. Objective QA of conventional (i.e., 2D) images and video have been an active research topic for several decades, but the research work on QA for 3D images and video is relatively young and less mature.

A 3D video can be defined as time-varying imagery that supports the binocular visual cue, which, in combination with other 3D visual cues, delivers a realistic perception of depth. In its simplest form, 3D video is formed using two separate video channels (i.e., left and right) in which the time-synchronized frames form stereo pairs. Early attempts to objectively quantify 3D video images have applied 2D metrics to each frame of the stereo pair. Each frame is viewed as a single image for which the quality is measured separately, and then the overall 3D quality is calculated by averaging over time and space (i.e., the mean of the left and right channel quality values). This approach, however, hardly corresponds to the actual binocular mechanisms of the human visual system (HVS) and, thus, hardly correlates with the subjective quality scores. Recently, the inclusion of some 3D factors as part of the quality evaluation process has been attempted [1]. In [2], a 3D discrete cosine transform (DCT)-based stereo QA method was proposed for mobile 3D video. The method attempts to model the mechanisms of binocular correspondence formation, using the information in the neighboring blocks and contrast masking by grouping similarly sized 4 × 4 blocks of pixels in the left and right channels for joint analysis in the 3D DCT domain. In [3], the local depth variance for each reference block is used to weigh the quality metric proposed in [2] appropriately. In [4], a monoscopic quality component and a stereoscopic quality component for measurement of stereoscopic image quality have been combined. The former component assesses the monoscopically perceived distortions caused by phenomena such as blurring, noise, and contrast change, while the latter assesses the perceived degradation of the binocular depth cues only. In [5], an overall stereo quality metric was proposed through the combination of image quality with disparity quality using a nonlinear function. In [6], the 3D video quality was analyzed on the basis of being composed of two parts: the stereoscopic 2D video quality and the depth map quality. In [7], a quality metric for color stereo images was proposed based on the use of the binocular energy contained in the left and right retinal images, which was calculated using the complex wavelet transform (CWT) and the bandelet transform. The authors of [8] proposed two approaches based on depth of image-based rendering to compare synthesized views and occlusions. Authors in [9] proposed an objective model for evaluation of the depth quality using subjective results. In [10], the performances of several state-of-the-art 2D quality metrics were compared for quantification of the quality of stereo pairs formed from two synthesized views. In [11] the authors studied the perception of stereoscopic crosstalk and performed a set of subjective tests to obtain mean opinion scores (MOS) of stereoscopic videos. They attempted to predict the MOS by combination of a structural similarity index (SSIM) map and pre-filtered dense disparity map. The quality metric proposed in [12] attempts to predict the perceived quality of color stereo video by a combination of contrast sensitivity function (CSF) filters with rational thresholds.

In [1], an analysis of the factors that influence the 3D quality of experience has been conducted. According to that analysis, the following HVS properties should be taken into account in the design of 3D quality metrics [13]. First, the HVS perceives ‘2D’ types of degradation after they are combined in the cyclopean view and not individually in the left and right channels. Therefore, it is meaningful to measure 2D artifacts on the cyclopean view. The forms of degradation related to the 3D geometry and perceived through disparity are characterized as ‘3D’ artifacts. Thus, the cyclopean image of both the degraded and the reference video streams should be extracted and compared, along with the binocular disparity that is presented in the degraded stream. Second, while the 2D and 3D artifacts can be assessed separately, the content in one visual path may influence the other. The binocular perception of depth is influenced by pictorial depth cues. It is possible that there may be masking or facilitation between the depth cues that come from the two visual paths. Consequently, the 3D quality is influenced by the 2D content. The perception of the asymmetric quality depends on the scene depth. Artifacts in the cyclopean view may be masked by the convergence process. Consequently, the 2D quality is then influenced by the 3D content. The overall quality of a 3D scene is therefore a combination of the ‘cyclopean’ and ‘binocular’ perceptual qualities.

Based on the above analysis, a new model for the assessment of 3D image quality is investigated in this paper. The model considers three components: the cyclopean view, binocular rivalry, and the depth presence. This general model aims to reflect the peculiarities of 3D scene perception. These peculiarities include the fusion of the left and right (stereo) images into a single (cyclopean) image and its 2D quality, the possible influence of binocular rivalry on visual comfort, and the influence of the depth presence on correct perception of the 3D scene geometry. The investigation aims to find suitable features to quantify the qualities of these three components in a 3D image to enable their combination, leading to an objective metric that is in accordance with the objective opinion. An abundant set of features that are used in the state-of-the-art 2D QA metrics is adopted, and a machine learning approach is applied to find the best combinations of these features. With regard to the formation of the cyclopean view, three different quality models are investigated that depend on whether the image fusion process is simulated at pixel level or at block level. The binocular disparity, i.e., the differences between the images seen in each eye, is an important cue that the HVS uses to perceive 3D scenes. However, artifacts in a stereoscopic pair may introduce unnatural stereoscopic correspondences that cannot be interpreted by the binocular HVS. These effects are perceived as a binocular rivalry, and this binocular rivalry must be quantified. The binocular suppression theory states that masking and facilitation effects exist between the images that are perceived by each eye [14]. It is anticipated that the masking between the eyes works in a similar manner to the masking effects between the different spatial orientations. In this paper, a local method for binocular rivalry evaluation is proposed that quantifies the quality of the binocular rivalry between the viewed left channel and right channel. Also, the depth presence is quantified using the disparity map, which gives the apparent motion between corresponding pixels in the left and right images.

To fuse the three proposed components in a perceptually driven manner, two mobile 3D video databases and related subjective tests are used [15, 16]. Earlier subjective studies aimed to set more precise limits for acceptance of the quality experienced when both the compression artifacts using different 3D video coding methods and varying amounts of depth are presented. They have also taken a more systematic approach to the examination of depth versus compression artifacts by varying a dense set of parameters that influence quality. In the first mobile 3D video database, the number of compression artifacts has been varied by selecting five quantization parameters (QPs) and the strength of the depth effect was varied by selecting two camera baseline ranges. The video sequences in the second 3D video database have been encoded using four different coding methods, including H.264/AVC Simulcast, H.264/AVC multiview video coding (MVC), mixed resolution stereo coding (MRSC), and video plus depth (V + D). The encoding parameters have been chosen in accordance with the settings of the prospective system for mobile 3D video delivery [15] to evaluate the perceived quality provided by each type of content. The combinations of the quality features according to our model, leading to the quality metric, are tested on both databases. The results show that this metric outperforms the current popular metrics over different 3D video formats and compression methods.

2 Image processing channel in stereo vision

A simplified model of the stereoscopic HVS is presented in Figure 1. The model follows the main functional stages of binocular vision, as discussed in [1]. In the first stage, the light captured by the eyes is processed separately in each eye. A set of perceptual HVS properties are produced by this processing, including light adaptation, contrast sensitivity, and low chromatic resolution. These properties can be modeled by luminance masking, conversion to a perceptual color space, and CSF-based masking, as shown in Figure 2. In the next stage, the visual information passes through the lateral geniculate nucleus (LGN), where the inputs from both eyes are processed together. It is assumed that the LGN decorrelates the stereoscopic signal and then forms the so-called cyclopean view [17]. The visual information is then fed to the V1 brain center, which is sensitive to patches with different spatial frequencies and orientations. The processes in the LGN and the V1 center can be modeled as multichannel decomposition, followed by binocular, spatial and temporal masking, as shown in Figure 2[17].

Figure 1
figure 1

Model of the optical path.

Figure 2
figure 2

Image processing channel for realization of the model.

The perceptual properties of the binocular vision suggest that the visual information is simultaneously processed in two different pathways, as shown in Figure 3. One pathway performs a fusion process using the binocular information to form a cyclopean view, which is a 2D representation of the scene as if it was observed from a virtual point that appears between the eyes [1]. During fusion, the HVS attempts to reconstruct details that are available to one eye only, which allows the observer to reconstruct any partially occluded details of the scene. The other pathway compares the images that have been projected onto each retina and extracts the distance information (also known as binocular depth cues [17]). Larger differences between the retinal images result in a more pronounced binocular depth. However, if these differences are too large, the images from the two eyes cannot be fused, and instead of the cyclopean view, the HVS perceives binocular rivalry [18]. Binocular rivalry is one of the major sources of visual discomfort in 3D video. This phenomenon can be caused by several factors, including physical misalignments, luminance, color, reflection, hyperconvergence, hyperdivergence, and ghosting [19].

Figure 3
figure 3

Model of the binocular fusion and depth extraction process.

Based on this model, we assume that the quality of a 3D image is perceived as a combination of two components: the quality of the cyclopean view, and the quality of the binocular image. The subjective experiments in [15] show that the presence of depth influences the perceived quality, and this influence can be either positive or negative, depending on the content. As described in [1], the same amount of blockiness is graded differently in scenes with differently pronounced depths. The presence of stereoscopic depth also affects the perceived overall quality. Larger binocular differences will increase the perceived binocular depth but may also reduce the quality of the cyclopean view. This effect is not monotonic, which indicates that there might be an ‘optimal’ global depth for a 3D scene on portable autostereoscopic displays, at which the HVS has the lowest sensitivity to any cyclopean image degradation.

3 Feature-based quality estimation

In this section, we propose a new 3D QA model composed of three components: the quality of the cyclopean view, the prominence of the binocular depth, and the presence of binocular rivalry. The block diagram of our model is shown in Figure 4. We select a set of features that (potentially) quantify each quality component. Combinations of these features are then matched against the MOS that were obtained from subjective quality tests.

Figure 4
figure 4

Block diagram of proposed 3D metric.

3.1 Cyclopean view assessment

The quality of the cyclopean view can be measured in a full-reference setting. The first step is to create the cyclopean views of the reference and the distorted stereo pairs. When both cyclopean views are available, we can compare the structural differences between the two cyclopean views using an ordinary full-reference 2D quality metric.

One way to create the cyclopean view is to generate a dense disparity map of the stereo pair and reconstruct the view from an intermediate observation position. In case the corresponding pixels in the two observations have different colors or intensities, the mean values of both properties are taken. This roughly corresponds to the way that the cyclopean view is fused by the HVS in the absence of stereoscopic rivalry [1]. However, rendering of the intermediate camera involves the interpolation of pixels from both views. To reduce the influence of any interpolation errors, we can fuse the two views and reconstruct an observation that matches one of the views. This can be done by warping one of the views onto the other - for example, by rendering the right view using the left view pixels and the disparity map - and then fusing the two views. Because we aim to assess the structural differences between the two cyclopean views, we assume that this transformation would still allow any distortions in either view to be quantified. Wherever occlusions occur, the available pixels from the opposite view are used. In our approach, we calculate a dense disparity map and an occlusion map between the left and right images using a color-weighted local search method [20].

Using the disparity map, the pixels in the right channel are then mapped to their positions in the left channel, which is denoted here as a ‘right to left’ mapping, i.e., R2L:

R 2 L x , y = I R x + Δ x , y , y , x = 1 N , y = 1 M ,

where (x, y) indicates the pixel location, M, N indicates the image size of one channel, IR is the image from the right channel, and Δ(x, y) is the pixel shift for the pixel at position (x, y). Occluded pixels are handled by replacing them with corresponding pixels from the left image:

R 2 ˜ L x , y = I L x , y , if Ω x , y = 1 R 2 L x , y , otherwise ,

where IL is the left image, Ω is the binary occlusion map, and Ω(x, y) = 1 marks the occluded pixels. The final cyclopean view, Icyc, is generated as the mean of the left image and the mapped image from the right image:

I cyc = I L + R 2 ˜ L 2 .

The cyclopean view formation process is shown in Figure 5, and an example of the cyclopean view obtained is given in Figure 6.

Figure 5
figure 5

Flow chart for generation of the cyclopean view.

Figure 6
figure 6

An illustration example of forming a cyclopean view. (a) Left. (b) Right. (c) Disparity map. (d) Right to left. (e) Occlusion map. (f) Updated right to left. (g) Cyclopean view.

When the cyclopean view is obtained, we then apply three quality evaluation models. Hereafter, we use the notation QA to denote any quality assessment measure, which compare the similarity (dissimilarity) between images or image blocks. Specific QAs are discussed in Section 3.4 where they are indexed (e.g., QA1, QA2…) to denote the particular assessment measure.

The first model assumes QA on a global basis:

Q 1 CV = QA I ref cyc , I dis cyc ,

where I ref cyc and I dis cyc are the cyclopean images that were obtained from the reference and distorted stereo pairs, respectively.

The second model evaluates the cyclopean view in a block-by-block fashion, as shown in Figure 7. In the left channel, an 8 × 8-sized reference block A starting at coordinates (i,j) is selected. The corresponding block in the disparity map is denoted by Δ ij . In the right channel, the block with the same coordinates (i,j) is marked B′. Using the disparity map, the corresponding block B is then found by taking the median of the disparity values in the disparity patch Δ ij :

d ^ = median Δ ij 8 × 8 ,

where Δ ij is the disparity mapping with coordinates (i,j), {}8 × 8 indicates an 8 × 8 block, and d ^ can be a positive value, zero, or a negative value. The model assumes that the quality of the block is represented by the quality of the better channel of the two,

Q 2 CV = i = 1 N blk max q i L , q i R N blk ,

where Nblk is the number of blocks, and q i L and q i R are the quality scores of the left and right channels, respectively,

q i L = QA A ref , A dis
q i R = QA B ref , B dis ,

where Aref is the reference block in the original left image I ref L , I dis is the corresponding block in the distorted left image I dis L , and Bref and Bdis are the corresponding blocks in I ref R and I dis R , respectively.

Figure 7
figure 7

Location of similar blocks for the second binocular rivalry quality model.

The third model closely follows the second model but assumes that the block quality is represented by the average of the quality values measured in the left and right channels:

Q 3 CV = i = 1 N blk q i L + q i R / 2 N blk .

3.2 Binocular rivalry assessment

Binocular rivalry occurs when the eyes attempt to converge on a single point in a scene, but the images seen by the two eyes are not sufficiently similar. Binocular rivalry can occur naturally in a complex 3D scene with numerous occlusions. However, the presence of severe artifacts in only one of the channels can cause unnatural binocular rivalry, which is perceived as a severe stereoscopic artifact. Binocular rivalry can be measured in a non-reference setting, i.e., by analyzing the distorted pair only. We assume that regardless of whether or not the rivalry is present in the original pair, its presence in the distorted pair would be equally disturbing. In our approach, we use the dense depth map to find the corresponding blocks in the two channels and measure the cumulative difference between the corresponding blocks, as follows:

Q BR = i = 1 N blk QA A dis , B dis N blk .

3.3 Binocular depth assessment

In this paper, we evaluate the presence of the binocular depth by estimation of the dense depth map for the stereo pair. We calculate a dense disparity map using the color-weighted local-window method described in [20]. The quality of depth QDQ is then studied as follows:

Q 1 DQ = QA Δ ref , Δ dis ,

where Δref is the disparity map from the original stereoscopic image, and Δdis is the disparity map from the distorted stereoscopic image. Here, QA denotes a QA function that uses one of the candidate features, as described in the Section 3.4.

3.4 Candidate features

Each of the three quality components described above relies on a comparison function denoted by QA. However, the data that are compared are not in the same modality in each case; in one case, we measure the similarity between the images, while in another we compare disparity maps. These cases are interpreted in different ways by the HVS, and the optimum similarity measure would be different for each case. To determine the most suitable measure in each case, we have selected and tested ten state-of-the-art QA methods.

We denote the original input image (block) by u and the distorted image (block) by v. The first quality feature is calculated based on the mean squared error (MSE), which is the most popular difference metric used in image and video processing:

QA 1 u , v = 1 MN i j u ij - v ij 2 .

The MSE is chosen because it is simple to calculate, has clear physical meaning, and is mathematically convenient in the context of optimization.

The second quality feature is the gradient-normalized sum-of-squared-difference (SSD) [21]. The result is normalized with reference to the gradient map and is calculated as the mean of the SSD. Any local intensity variations in the textured areas between u and v are thus penalized:

QA 2 u , v = 1 MN i j u ij - v ij 2 u ij 2 + 1

where u ij is the gradient value of input signal u.

Many studies have confirmed that the HVS is more sensitive to low-frequency distortions rather than to those at high frequencies. It is also very sensitive to contrast changes and noise. Therefore, the third measure aims to remove the mean shifting and contrast stretching in the manner shown in [22]. The measure is calculated in 8 × 8 blocks and uses the decorrelation properties of the block DCT and the effects of the individual DCT coefficients on the overall perception:

QA 3 = 1 N blk i = 1 M - 7 j = 1 N - 7 E w u - v
E w u = 1 64 i = 1 8 j = 1 8 DCT u ij 2 T c ij ,

where Tc is the matrix of correction factors for each of the 8 × 8 DCT coefficients, which was normalized based on the JPEG quantization table in [22].

The fourth quality measure is inspired by [23], which was designed based on [22] by taking the CSF and the between-coefficient contrast masking of the DCT basis functions into account. In the same manner shown in [22], the measure operates with the values of the DCT coefficients of the 8 × 8 pixel block. The model allows each DCT coefficient to calculate its own maximum distortion value that is not visible because of the between-coefficient masking. It is assumed that the masking degree of each coefficient DCT(u) ij depends upon its square value (power) and on the human eye sensitivity to this DCT basis function as determined using the CSF. Several basis functions can jointly mask one or several other basis functions. Then their masking effect value depends upon the sum of their weighted powers [23]. The final formula is expressed as follows:

QA 4 = 1 N blk i = 1 M - 7 j = 1 N - 7 E w u - v MaskEff ,

where MaskEff is the reduction of the masking and contrast operation given in [23].

The fifth measure is based on the feature similarity index (FSIM) method proposed in [24]. FSIM was designed to compare the low-level feature sets of the reference image and the distorted image. Phase congruency (PC) and the gradient magnitude (GM) are used in FSIM and play complementary roles in the characterization of the local image quality. The measure is defined as

QA 5 = i j S L u ij , v ij PC m u ij , v ij i j PC m u ij , v ij
PC m x , y = MAX PC x , PC y ,

where PC is the phase congruency operation of [25], and SL(u, v) is the similarity map formed by combination of the similarities of PC and GM as SL = SPC × SGM. SPC and SGM are calculated as

S PC u , v = 2 PC u PC v + T 1 PC 2 u + PC 2 v + T 1
S GM u , v = 2 GM u GM v + T 2 GM 2 u + GM 2 v + T 2 ,

where T1 and T2 are positive constants. In our work, in addition to the compound measure QA5, we also consider the individual components, i.e., the PC and the GM, separately, and thus form the sixth and seventh measures, respectively:

QA 6 = i j S PC u ij , v ij PC m u ij , v ij i j PC m u ij , v ij
QA 7 = S GM u , v .

The SSIM metric proposed in [26] is considered in the formation of the eighth candidate quality measure. The measure is composed using the luminance comparison l(u,v), the contrast comparison c(u, v) and the structure comparison s(u,v), as follows:

QA 8 = l u , v c u , v s u , v ,
l u , v = 2 μ u μ v + c 1 μ u 2 + μ v 2 + c 1 ,
c u , v = 2 cov uv + c 2 σ u 2 + σ v 2 + c 2 ,
s u , v = σ uv + C 3 σ u σ v + C 3 ,

where μ u and μ v are the means of u and v, respectively, σ u 2 and σ v 2 are the variances of u and v, respectively, cov uv is the covariance of v, c1 and c2 are the two variables used to stabilize the division with a weak denominator, and c3 = c2 / 2. In this paper, QA9 is defined as the luminance comparison and QA 10 = 2 σ uv + c 2 σ u 2 + σ v 2 + c 2 , which is a simplified formula for c(u, v) × s(u, v), as shown in [26].

3.5 Machine learning methods for feature fusion

As described in the previous sections, the proposed quality approach aims to combine three different measures, by separately measuring the quality of the cyclopean view, the binocular rivalry, and the presence of depth. The limited knowledge of the subjective quality perception of 3D images means that it is not possible to predict which of the QA models will produce the best correlation with the subjective scores. Therefore, to find the best combination of quality measures and image features, we adopt a machine learning approach.

We assume that the best combination of features can be found by linear regression. Given a set of quality measures φ(k,l), the MOS over a set of test videos Θ k are predicted using linear combinations where

Θ k = θ ^ 0 + l = 1 L φ k , l θ ^ l ,


Θ = Θ 1 Θ 2 Θ K = 1 , φ 1 , 1 , φ 1 , 2 , , φ 1 , L 1 , φ 2 , 1 , φ 2 , 2 , , φ 2 , L 1 , φ K , 1 , φ K , 2 , , φ K , L θ ^ 0 θ ^ 1 θ ^ L ,

where the vector Θ represents the subjective scores, L is the number of quality measures, K is the number of test stimuli (videos), and θ ^ 0 , 1 , 2 , , L are the parameters of the model. The above linear model in vector form can also be rewritten as an inner product:

Θ = φ T θ ^ .

To fit the linear model to a set of training data, θ ^ is normally determined using the least squares method [27]:

f θ = i = 1 K Θ i - φ i T θ 2 = Θ - φ T θ T Θ - φ T θ ,

where f(θ) is the cost function, and θ can be chosen to minimize f(θ) using its derivatives, where

θ f θ = θ Θ - φ T θ T Θ - φ T θ = φ T Θ - φ T φ θ = 0 .


θ = φ T φ - 1 φ T Θ ,

where the array of quality measures φ is formed by the proposed 3D quality models, where φ = [QCV, QBR, QDQ].

Efficient solution of Equation 28. Using Equation 32 requires a simple, reasonable, and efficient quality measure array and the use of subjective scores from properly conducted subjective experiments. The subjective experiments are described in the Section 4.

4 Mobile 3D video test content and related subjective tests

Two mobile 3D video databases and their corresponding subjective tests have been used for this study. The first database, denoted by ‘3D database I’, contains four stereoscopic videos, called Akko&Kayo, Champagne Tower, Pantomime, and LoveBirds1, with varying levels of compression artifacts and depth presence [16]. Thumbnails of the videos in this database are shown in Figure 8. The database has 60 videos and consists of four scenes; each scene is captured in stereo using three different baselines, and each captured video is compressed by an H.264 encoder using five different values for the QP.

Figure 8
figure 8

Contents of 3D video database I.

The original videos are high-resolution multiview videos. They have been converted into stereo videos with lower resolution by suitable rescaling. To maintain the variable depth levels, each video sequence has been retargeted by selecting different camera pairs from the available multiview video sequences. For all sequences, the left camera has been retained, while the right camera was selected at two different depth levels called the short baseline and the wide baseline. In addition, a monoscopic video sequence was generated by repeating the left channel sequence in the place of the right channel sequence. This would effectively present a 2D view with no depth effects on the 3D display. The short baseline produces a 3D scene within a limited disparity range but with visible 3D effects. The wide baseline provides an optimal disparity range for the mobile stereoscopic display by setting the right camera position. All sequences were then downscaled to lower resolutions for the target display device. After that, each video was encoded using the H.264/AVC Simulcast method in intra-frame mode. The QP was selected in the [25, 30, 35, 40, 45] range and compression was independently applied to the left and right channels.

Thirty-two observers were involved in the subjective tests and were equally distributed in terms of gender with an age range between 18 and 37. The test materials were presented one by one in a pseudo-random order. The display device used was an autostereoscopic screen with a resolution of 428 × 240 pixels per view. After each clip, the test participants were asked to provide overall quality scores on a scale from 0 to 10 and indicate the acceptability of the quality for viewing the mobile 3D video on a yes/no scale. At the beginning of each session, a training set of seven clips was shown. Each test stimulus was shown twice during the test. A set of dummy videos was also shown at the beginning and in the middle of each test session. A total of 164 video clips were shown to each observer [15]. The overall ratings of the stereoscopic videos were finally ranked in terms of their MOS.

The second database contains six different videos spanning different genres of mobile 3DTV and video: these videos are Bullinger, Butterfly, Car, Horse, Mountain, and Soccer2, as shown in Figure 9. This set of videos is intended to represent a range of stereoscopic videos with different content properties, including varying spatial details, temporal changes, and depth complexity. Each video sequence lasts 10 s.

Figure 9
figure 9

Contents of 3D video database II.

The sequences were encoded using four different methods: H.264/AVC Simulcast, H.264/AVC MVC, MRSC, and V + D. The encoding parameters were chosen as shown in Table 1[15]. Coding was carried out using two codec profiles: the baseline profile and the high profile. The simple baseline profile uses an IPPP prediction structure and context-adaptive variable-length code (CAVLC) [28] prediction. The group of picture (GOP) size was set at 1. This refers to the low-complexity encoder for mobile devices. The more complex high profile enables hierarchical B-frames with GOP sizes of 8 and context-based adaptive binary arithmetic coding (CABAC) quantization. Because of the variable compressibilities of the different sequences, individual bit rate points were determined for each sequence [15]. The QP of the codec was set at 30 for high quality and 37 for low quality. In total, the database has six reference sequences and 96 distorted 3D video sequences.

Table 1 Codec settings of the two profiles

Subjective tests were carried out with 87 test participants that were evenly divided in terms of gender and with ages ranging between 16 and 37 years. The visualization process was performed by following the same test procedure and using the same autostereoscopic display as that used in the tests with 3D database I. The MOS for both tests are of the same scale.

5 Feature selection

Both subjective experiments were performed while following the same protocol and using the same device and the same quality evaluation scale. Therefore, we were able to combine the entries from the two databases into a single group of opinion scores within the same scale. We picked 70% of the entries by random selection for forming a training set. The rest of the entries were included in a test set. We measured the prediction performances of the different feature groups using the Spearman rank-order correlation coefficient (SROCC). The SROCC output is in the [-1, 1] range, where a higher absolute value or SROCC indicates a stronger monotonic relationship between the MOS and the values that were predicted using the metric.

The set of feature candidates consists of 50 items, numbered between 1 and 50. There are three quality components: the cyclopean view (denoted by QCV), the binocular rivalry (QBR), and the depth quality (QDQ). The quality of the cyclopean view is assessed using three alternative approaches: global comparison Q 1 CV , block-wise selection of the better channel Q 2 CV 0029 , and the block-wise average Q 2 CV . A set of ten measures was applied to each quality component. The feature candidates are listed in the first row of Table 2. The measures are listed in the first column of the same table. For example, 1 indicates the quality assessment QA1 under cyclopean view model 1, i.e., Q 1 CV , QA 1 ; 33 indicates the quality assessment QA3 under the binocular rivalry model, i.e., {QBR, QA3}. The quality measures that are not relevant to the comparison of the depth maps are excluded from the experiments. These combinations are marked with a dash in Table 2.

Table 2 Spearman correlations of each quality feature and each quality component

We use a regression fitting to measure the performances of the individual features. First, the output of each candidate feature listed in Section 3.4 was normalized to the range [-10, 10], using logistic fitting as follows:

f x = β 1 1 + β 2 - β 3 β 3 + e - x β 4 .

The parameters β1, β2, β3, and β4 have been selected in each individual case so that the output of each feature fits into the desired range. Then we evaluate the performance of each i feature in terms of Spearman correlation. The results of this evaluation are given in Table 2 in columns 2 to 6. The combined performances of all quality measurements applied to a given component are shown on the bottom row of the table, and this measure is denoted by SROCC1. The results in this row indicate the applicability of a single component for use in subjective quality prediction. The combined performance values for the single quality measure when applied to all components are given in the last column of the table, which is labeled SROCC2. These results indicate the applicability of a given quality measure.

From these results, we can see that the use of a single quality component is insufficient because the quality values predicted by a single component do not correlate well with the subjective scores. The best correlation is achieved when using feature 15, i.e., Q 2 CV , QA 5 . Using the cyclopean view components (e.g., Q i 1 , 2 , 3 CR , QA i 1 , , 10 ), we can achieve SROCC values of more than 0.9. This result can be interpreted as evidence that the 2D quality of the cyclopean view is a major component of the overall perceived quality.

In the next experiment, we attempted to find a combination of features and quality measures that produced a good trade-off between prediction accuracy and computational complexity. We performed a sequential feature search, looking for the best combination of n + 1 features using the best combination of n features and adding one feature at a time. In this manner, we were able to extract 45 features until we reached the SROCC value of 0.97, as shown in Figure 10. By studying the performance improvements introduced by each feature selection (as shown in Figure 11), we see that a combination of four or five features will result in a good accuracy vs. complexity trade-off. The difference in performance for each two consecutive number of features is given in Figure 11, and the difference between the performance for four and five features is marked with a red circle. The first five features in the sequential search are {{15,18,30,34,47}}, where {15,18,30} evaluates the cyclopean view, 34 evaluates the binocular rivalry, and 47 evaluates the depth quality.

Figure 10
figure 10

Quality performances with sequential feature selection.

Figure 11
figure 11

Performance improvements of different numbers of selected features.

The computational complexities of the best performing combinations of four or five features are shown in Table 3 and in Figure 12. The Big O complexity, the McCabe complexity, and the CPU running time for each combination are shown in Table 3. The Big O notation specifically describes the worst-case scenario. The McCabe complexity was proposed in [29] and was also called the cyclomatic complexity or the conditional complexity. McCabe describes the independent paths through the source code as a directed graph. The McCabe complexity is calculated from the cyclomatic number of its graph [29]. The CPU time listed in Table 3 is the time taken to run ten images in each QA i using MATLAB 2012b on the Win64 OS with the Intel Core Duo E8400 CPU. For comparison, the last row of Table 3 contains the complexity of dense depth estimation and the time it needs to calculate the disparity map of the ten images using search window of 50 pixels on the same computer. Disparity estimation is a step which is required for the calculation of all considered features (see Figure 4), and its computational complexity is in the same range as the complexity of the features. The McCabe complexity and the CPU time of all candidates are compared with the complexity of disparity estimation in Figure 12.

Table 3 QA computational complexity
Figure 12
figure 12

QA computational complexity comparison.

To find an optimal group of features, we estimated the performances of all combinations of five and six features. Since the computational overhead for deriving dense disparity map is the same in each case, we did not take it into account in the feature selection process. We found that 5 groups of five features and 18 groups of six features had SROCC scores that were higher than 0.93. The best performing groups of five features are listed in Table 4, and the best performing groups of six features are listed in Table 5. The complexity levels of each group were calculated based on the McCabe complexities, and the CPU times are shown in Table 3 and Figure 12. From Table 4, we see that the previously found feature group, {15,18,30,34,47}, is not the group with the lowest complexity, with a McCabe complexity of 108 and a running time for a single image set of 6.11 s. The fastest quality measure, {25,28,30,41,48}, does not contain a component that is sensitive to binocular rivalry. Therefore, by considering the complexity, the correlation performance, and the sensitivity of the metric to different artifacts, we selected the second-fastest feature group {25,28,33,41,48} for the final quality metric. The output of each feature was normalized according to formula (33). The weighting and normalization coefficients used for each feature are given in Table 6.

Table 4 Computational comparisons for five quality features (full search)
Table 5 Computational comparisons for six quality features (full search)
Table 6 Linearization and weighting coefficients used in the final quality metric

This selection is also confirmed by the results of the full searches over six features. These combinations reach correlation performances of 0.93, but at considerably higher computational costs. However, we can see that the feature evaluation components from the first two groups (CV and BR) tend to dominate the best performing combinations. It should be noted that the performance is calculated based only on the training subset of test videos, and by selecting a three-component combination, we aim to provide a balanced combination for a wider, and possibly more diverse, set of videos.

6 Comparative results

The prediction performance of an objective quality metric can be evaluated in terms of accuracy, monotonicity, and association. We use the normalized root mean squared error (RMSE), the SROCC, and the Pearson linear correlation coefficient (PLCC) to quantify the corresponding performance properties of our metric. Before calculation of the correlation performance, we apply a logistic fitting function to all quality metrics under comparison.

The subjective experiments performed on the two sets of test sequences have been analyzed in [15, 16]. Some findings relevant to our current work are summarized here. The results of subjective experiments, involving 3D database I were interpreted in [15] as both the artifact level and the presence of stereoscopic depth affect the user acceptance of and satisfaction with the 3D video sequences. Also, according to the subjective test results for database II [16], MVC and the V + D approach provide the best subjective quality for all compression levels. We believe that a well-performing 3D quality metric should be able to predict these subjective preferences.

We compared the feature group proposed in Section 5 (i.e., {25,28,33,41,48}) with several state-of-the-art quality metrics. The results are as shown in Tables 7 and 8. The metrics that were intended for 2D image quality [i.e., peak signal-to-noise ratio (PSNR), SSIM, normalized root mean squared error (NRMSE), and PSNR-HVS] have been applied separately to the left and right channels and the final results have been averaged. In the PSNR case, the MSE derived in each channel was averaged in advance. Four metrics that predict the quality of the stereoscopic content have been included in the comparison: PHVS3D [2], PHSD [3], 3DBE [7], and the stereo metric, described in [5]. The quality values for 3DBE were kindly provided by the authors of the metric in [7]. All metrics work on the luminance components of the images.

Table 7 Spearman and Pearson correlations of compared metrics on 3D video database I
Table 8 Spearman and Pearson correlations of compared metrics on 3D video database II

The SROCC, PLCC, and normalized RMSE values for each compared QA on 3D databases I and II can be seen in Tables 7 and 8, respectively. For visual comparison, prediction results for databases I and II are shown in Figures 13 and 14, respectively. To quantify the performance in terms of their different aspects, the videos from both databases are grouped into several subsets. Test sequences in 3D database I are classified into three subsets based on the depth levels in Table 7: ‘mono’ is used for monoscopic sequences, and short and wide are used for stereoscopic sequences. 3D database II is grouped into four subsets based on the encoding methods used, i.e., MRSC, MVC, SIM, and V + D. The two algorithms with the best performance levels are marked in bold.

Figure 13
figure 13

Logistic fitting figures of the different and proposed metrics on 3D video database I.

Figure 14
figure 14

Logistic fitting figures of the different metrics and proposed metric on 3D video database II.

The predominant distortions in this database are caused by DCT-based compression and are manifested as blocking and smearing artifact characteristic for harsh quantization levels. These distortions affect the cyclopean view quality and can be detected by quality metrics, sensitive to texture degradation. PSNR-HVS produces the third best performance on the mono set, where the SROCC, PLCC, and RMSE values are 0.921, 0.917, and 0.716, respectively. PHSD and PHVS3D also correlate well with the MOS in that database. PHSD is an improved version of PHVS3D, in which the disparity errors are considered. The SSIM metric, if used separately in each channel, does not correlate well with the subjective scores of 3D database I. The proposed combination of five quality features has the best correlation with the MOS, which were compared using either SROCC or PLCC. The overall correlations of SROCC, PLCC, and RMSE reach 0.935, 0.924, and 0.684 correspondingly. Most QA metrics fail on the wide baseline sets. The proposed metric shows higher correlations on all subsets, and the SROCC values of the short and wide baseline subsets are quite consistent at 0.956 and 0.952, respectively.

3D database II contains a wider range of video distortions, most notably some cases of severe binocular rivalry. Such distortions do not affect large areas of the image but are immediately visible to the observer. As a result, quality metrics assessing texture quality tend to grade such cases as being of high quality, while observers grade them as having annoying artifacts. Most of the metrics included in our comparison fail on the ‘V + D’ set, particularly PSNR, PSNR-HVS, PHVS3D, and SSIM, for which the PLCC values are less than 0.1. This can be attributed to the presence of binocular rivalry artifacts which are caused by view rendering based on the estimated depths. For most videos exhibiting stereoscopic distortions, 2D metrics fail to predict the subjective scores. The overall SROCC values of PSNR and PSNR-HVS are only 0.254 and 0.227, respectively. Although the results for SSIM and NRMSE are slightly improved, their overall SROCC values are still very low. Among the 3D quality metrics, the PHVS3D metric does not perform well, but the improved PHSD version has the second best correlation with all the MOS in the database. Finally, in Table 8, we see that the metric proposed in this paper shows better performance because it is sensitive to a wider range of stereoscopic distortions.

7 Conclusions

One of the biggest challenges in 3D QA is the calculation of the QA metric in a perceptual manner. In this paper, a novel full-reference stereoscopic quality metric that is applicable to mobile 3D video has been proposed. First, we built two 3D quality databases that were annotated with subjective test results in terms of their MOS. The databases include not only compression distortions but also differently pronounced depth and 3D format conversion distortions. According to the results of subjective tests and interviews with the test participants [15], the number of compression artifacts is dominant in the evaluation of the content quality, whereas the presence of depth enhances the user experience. The viewers were very critical of the spatial quality and accepted only low numbers of artifacts in the content. The 3D effect enhances the user satisfaction and acceptance of the content; however, if the content is not presented with high spatial quality, then the content was declared to be less acceptable or completely unacceptable, regardless of the 3D effect. Motivated by these results, we modeled the 3D quality using three components: the cyclopean view, binocular rivalry, and the depth quality. The cyclopean view is simulated using three models. The first model generates a single cyclopean image by globally fusing the left and right views of a scene based on the properties of human stereo vision. The second and third models are based on local fusion methods, which calculate the quality on the block level between the left and right channels using a disparity map. Dissimilar visual stimuli between the two eyes bring binocular rivalry. In our approach, the amount of binocular rivalry is quantified by comparison of only the corresponding blocks in the distorted stereoscopic pair, using the disparity map that is provided by the reference pair. The differences between the images of a scene as seen by each eye are also used to form the perceived depth. The geometrical distortions are measured directly on the disparity map (and are called the depth quality).

Several QA methods are used to assess each quality component, with tests conducted using a training set that was extracted from the two available databases. To make the quality metric simple, fast, and efficient, the feature selection for all considered QAs is processed by studying their computational complexity and the CPU run times. Finally, six features are selected for the three components. The cyclopean view is measured by two quality assessment methods, i.e., QA5 and QA8, which are both under the third (local) cyclopean view model; binocular rivalry is evaluated using QA3; and the depth quality is measured using the disparity map with QA1 and QA8. The experimental results have shown that the proposed metric significantly outperforms the current state-of-the-art quality metrics. We must note that our implementation does not take masking effects created by motion into account. This is will be studied in our future investigations. However, our experiments to date have shown that this masking plays a minor role in estimation of the quality. This observation has been confirmed by subjective tests on still images and videos with the same content, which resulted in very similar MOS.



contrast sensitivity function


complex wavelet transform


discrete cosine transform


feature similarity index


gradient magnitude


human visual system


image quality assessment


lateral geniculate nucleus


mean opinion scores


mixed resolution stereo coding


mean squared error


multiview video coding


phase congruency


Pearson rank-order correlation QA, quality assessment


quantization parameters




structural similarity index


Spearman rank-order correlation

V + D:

video plus depth.


  1. Boev A, Poikela M, Gotchev A, Aksay A: Modeling of the stereoscopic HVS, Mobile3DTV. Technical Report D5.3. 2010. Available at Accessed on 13 January 2014

    Google Scholar 

  2. Jin L, Boev A, Gotchev A, Egiazarian K: 3D-DCT base perceptual quality assessment of stereo video. IEEE 18th International Conference on Image Processing (IEEE ICIP2011) Brussels, 11–14 September 2011

    Google Scholar 

  3. Jin L, Boev A, Gotchev A, Egiazarian K: Validation of a new full reference metric for quality assessment of mobile 3DTV content. The 19th European Signal Processing Conference (EUSIPCO-2011) Barcelona, 29 August to 2 September 2011

    Google Scholar 

  4. Boev A, Gotchev A, Egiazarian K, Aksay A, Akar GB: Towards compound stereo-video quality metric: a specific encoder-based framework. IEEE Southwest Symposium on Image Analysis and Interpretation 218-222. Denver, June 2006

    Google Scholar 

  5. You J, Xing L, Perkis A, Wang X: Perceptual quality assessment for stereoscopic images based on 2D image quality metrics and disparity analysis. International Workshop on Video Processing and Quality Metrics for Consumer Electronics - VPQM Scottsdale, 13–15 January 2010

    Google Scholar 

  6. Wang K, Brunnström K, Barkowsky M, Urvoy M, Sjöström M, Le Callet P, Touranchean S, Andrén B: Stereoscopic 3D video coding quality evaluation with 2D objective metrics. Proc. SPIE Electronic Imaging 2013, 8648. Stereoscopic Displays and Applications XXIV, 86481L, San Francisco, March 12, 2013. doi: 10.1117/12.2003664

    Google Scholar 

  7. Bensalma R, Larabi MC: Towards a perceptual quality metric for color stereo images. IEEE 17th International Conferences on Image Processing Hong Kong, 26–29 September 2010

    Google Scholar 

  8. Bosc E, Pepion R, Le Callet P, Koppel M, Ndjiki-Nya P, Pressigout M, Morin L: Towards a new quality metric for 3-D synthesized view assessment. IEEE J. Sel. Top. Sign. Proces. 2011, 5(7):1332-1343.

    Article  Google Scholar 

  9. Lebreton PR, Raake A, Barkowsky M, Le Callet P: Evaluating depth perception of 3D stereoscopic videos. IEEE Sel. Top. Sign. Proces. 2012, 6(6):710-720.

    Article  Google Scholar 

  10. Hanhart P, Ebrahimi T: Quality assessment of a stereo pair formed from two synthesized views using objective metrics. Proceedings of Seventh International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM 2013) Scottsdale, 30 January to 1 February 2013

    Google Scholar 

  11. Xing L, You J, Ebrahimi T, Perkis A: Assessment of stereoscopic crosstalk perception. Multimedia, IEEE Trans. 2012, 14(2):326-337.

    Article  Google Scholar 

  12. Maalouf A, Larabi MC: CYCLOP: a stereo color image quality assessment metric. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1161-1164. Prague, 22–27 May 2011

  13. Seuntiëns P: Visual Experience of 3D TV, Thesis. 2006.

    Google Scholar 

  14. Lambooij MTM, Ijsselsteijn WA, Heynderickx I: Visual discomfort in stereoscopic displays: a review. Proceedings of SPIE, Stereoscopic Displays and Virtual Reality Systems XIV San Jose, 1–13 January 2007

    Google Scholar 

  15. Strohmeier D, Jumisko-Pyykkö S, Kunze K, Tech G, Bugdayci D, Oguz Bici M: Results of quality attributes of coding, transmission and their combinations, Mobile 3DTV Technical Report D4.3. 2010. Available at Accessed on 13 January 2014

    Google Scholar 

  16. Jin L, Boev A, Jumisko-Pyykkö S, Haustola T, Gotchev A: Novel stereo quality metrics, MOBILIE 3DTV Technical Report D5.5. 2011. Available at Accessed on 14 January 2014

    Google Scholar 

  17. Wandell Brian A: Foundations of Vision. Sunderland: Sinauer Associates; 1995.

    Google Scholar 

  18. Blake R: A primer on binocular rivalry, including current controversies. Brain Mind 2011, 2(1):5-38.

    Article  MathSciNet  Google Scholar 

  19. Knorr S, Ide K, Kunter M, Sikora T: Basic rules for good 3D and avoidance of visual discomfort. International Broadcasting Convention (IBC) Amsterdam, 8–13 September 2011

    Google Scholar 

  20. Smirnov S, Gotchev A, Hannuksela M: Comparative analysis of local binocular and trinocular depth estimation approaches. Proc. of SPIE 7724(2010): doi:10.1117/12.854765

  21. Baker S, Scharstein D, Lewis J, Roth S, Black M, Szeliski R: A database and evaluation methodology for optical flow. Proceedings of the IEEE International Conference on Computer Vision 243-246. Crete, Greece, 14–21 October 2007

    Google Scholar 

  22. Egiazarian K, Astola J, Ponomarenko N, Lukin V, Battisti F, Carli M: New full-reference quality metrics based on HVS. International Workshop on Video Processing and Quality Metrics 4. Scottsdale, January 2006

    Google Scholar 

  23. Ponomarenko N, Silvestri F, Egiazarian K, Carli M, Astola J, Lukin V: On between-coefficient contrast masking of DCT basis functions. International Workshop on Video Processing and Quality Metrics for Consumer Electronics 25-26. Scottsdale, January 2007

    Google Scholar 

  24. Zhang L, Zhang L, Mou X, Zhang D: FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Process. 2011, 20(8):2378-2386.

    Article  MathSciNet  Google Scholar 

  25. Kovesi P: Image features from phase congruency. Videre: J Comput Vision Res 1999. MIT Press. Volume 1, Number 3

    Google Scholar 

  26. Wang Z, Bovik A, Sheikh H, Simoncelli E: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13(4):600-612. 10.1109/TIP.2003.819861

    Article  Google Scholar 

  27. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition. Heidelberg: Springer; 2009.

    Book  Google Scholar 

  28. Wiegand T, Sullivan GJ, Bjøntegaard G, Luthra A: Overview of the H.264/AVC video coding standard. IEEE Trans. Circ. Syst. Video Tech. 2003, 13: 560.

    Article  Google Scholar 

  29. McCabe TJ: A complexity measure. IEEE Trans. Soft. Eng. 1976, SE-2(4):308.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lina Jin.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jin, L., Boev, A., Egiazarian, K. et al. Quantifying the importance of cyclopean view and binocular rivalry-related features for objective quality assessment of mobile 3D video. J Image Video Proc 2014, 6 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: