Stereoscopic visual saliency prediction based on stereo contrast and stereo focus

In this paper, we exploit two characteristics of stereoscopic vision: the pop-out effect and the comfort zone. We propose a visual saliency prediction model for stereoscopic images based on stereo contrast and stereo focus models. The stereo contrast model measures stereo saliency based on the color/depth contrast and the pop-out effect. The stereo focus model describes the degree of focus based on monocular focus and the comfort zone. After obtaining the values of the stereo contrast and stereo focus models in parallel, an enhancement based on clustering is performed on both values. We then apply a multi-scale fusion to form the respective maps of the two models. Last, we use a Bayesian integration scheme to integrate the two maps (the stereo contrast and stereo focus maps) into the stereo saliency map. Experimental results on two eye-tracking databases show that our proposed method outperforms the state-of-the-art saliency models.


Introduction
Visual attention is a very important research topic in computer vision, as it is widely used in the field for many tasks, such as object detection [1] and video/image retrieval [2,3]. Computational models of visual attention, which simulate the attention mechanism of humans, have been built by researchers in many fields, such as visual neuroscience, computer vision, and multimedia processing [4]. Visual attention enables the discovery of an object or region that efficiently represents a scene and, thus, harnesses complex vision problems, such as scene understanding.
The models of visual attention are usually divided into two categories: bottom-up and top-down [5]. The bottom-up model is a rapid data-driven taskindependent process and is usually feed-forward. A prototypical example of a bottom-up model is the act of looking at a scene which has only one horizontal bar among several vertical bars, in which attention is immediately drawn to the horizontal bar [6]. Top-down model considers high-level cognitive features to quantify the visual saliency, such as human faces [7] and prior knowledge about the target [8]. Of these top-down features, prior knowledge about the target is difficult to model. Recently, a number of saliency models have incorporated both top-down and bottom-up feature detection in an effort to improve prediction accuracy [9]. Wei et al. [10] turned to background priors to guide the generic object level saliency detection. Goferman et al. [11] and Judd et al. [7] integrate high-level information, making their methods potentially suitable for specific tasks.
These models are mainly designed for 2D images. With the rapid development of 3D technology, many devices for stereoscopic capture have appeared. For example, the Panasonic 3D camera captures the stereoscopic images and video for 3D movies. The Kinect-1 device by Microsoft for the XBox captures both the color map and the depth map at the same time, which can generate the stereoscopic images (the depth map of the Kinect-1 may have holes that need to be smoothed [12], which may cause noise). These devices make up a number of applications for 3D images or videos, such as 3D rendering [13], 3D visual quality assessment [14], and 3D video detection [15]. These 3D applications increase the need for saliency modeling for 3D visual content.
Stereo saliency models can be classified into two categories according to the way they use the depth factor: stereo-vision models and depth-saliency models.
Stereo-vision models take into account the mechanisms of stereoscopic perception in the human visual system (HVS). This type of model considers the characteristics of depth factors and color information. Bruce and Tsotsos extended the 2D model, which uses a visual pyramid processing architecture [16], by adding neuronal units to model the stereo vision; however, they did not propose a computational model in that study. Based on our knowledge, designing the stereo-vision model is very difficult and we only find two models in [17], because the mechanisms of stereo vision still pose several research challenges, such as how to build then apply the model for the stereoscopic vision mechanism.
Depth-saliency models take depth saliency as a feature of saliency measurement, and methods of formulating and using depth saliency fall into two further categories. One category relies on a depth-saliency map (DSM) [17,18]. The depth saliency is extracted from the depth map or disparity map (usually based on depth contrast or the depth pop-out effect) to create an additional depth-saliency map. The final result combines the 2D saliency maps (from 2D saliency models usually using color contrast, intensity, or image texture) and the depth-saliency maps (DSM). The other category builds the model directly. In other words, it builds the stereoscopic visual saliency prediction model by taking the mechanisms of stereoscopic perception in the HVS into account. It designs the model by fusing the depth and 2D features into the saliency measurement, based on the mechanisms of the HVS [19].
Kim et al. [15] designed a stereoscopic visual attention algorithm for 3D video based on multiple perceptual stimuli, which assumes that pixels closer to observers and at the front of the screen are more salient. Niu et al. [20] explored stereo saliency by analyzing the characteristics of stereo vision and proposed a depth saliency model for a depth map that would expand the 2D saliency model for stereo saliency analysis. However, the proposed model does not fully explore the relationship between the depth model and the 2D saliency model. Fan et al. [19] proposed a stereo saliency model based on region-level depth, color, and spatial information. Wang et al. [17] proposed a computational model that takes the depth factors as an additional visual dimension and provides a public database with a ground truth of eye-tracking data. Fang et al. [21] proposed a visual attention model for stereoscopic images based on the contrast between low-level features. However, they did not consider the characteristics of human stereo vision, such as the pop-out effect or 3D fatigue.
According to the above analysis, the key issue for a 3D visual saliency prediction model is how to adopt the depth factor and how to combine the depth factor with 2D information based on the mechanisms of HVS. In our earlier work [22], a novel saliency model for stereoscopic images was proposed. However, this model did not deeply exploit the HVS characteristics of the popout effect and comfort zone and only treated the depth information as a weight. In this paper, we deeply analyze two characteristics of the stereoscopic vision: pop-out effect and comfort zone. Based on these characteristics, we design two stereo-vision models for visual saliency prediction: one based on stereo contrast and the other based on stereo focus. We enhance these two models by clustering and then integrate them into the final stereoscopic saliency map.
The main contributions of this paper are as follows: 1. We propose a stereo contrast model for detecting stereo saliency. This model detects saliency based on color and depth contrast and the pop-out effect. 2. We propose a stereo focus model for detecting stereo saliency. This model detects the degree of focus via monocular focus and the comfort zone. 3. We propose an enhancement to increase the performance of the stereo contrast and stereo focus models.
The rest of the paper is organized as follows: In Section 2, we introduce the two mechanisms of stereo human vision for stereo saliency analysis. Section 3 proposes a new stereo visual saliency prediction method based on the stereo contrast and stereo focus models. Section 4 describes a quantitative comparison of the proposed model and state-of-the-art algorithms. Section 5 provides the research outcomes and future work.

Methodology
When watching a stereoscopic image, people experience different effects, such as the pop-out effect and deep-in effect [23]. When we watch a stereoscopic image/video, the pop-out effect occurs when an object looks like it is going to pop out of the screen and the deep-in effect occurs when an object looks like it is behind the screen. To obtain these two effects, we can control the parallax of objects, such as the negative or positive parallax as shown in Fig. 1. This finding is based on recent research on human stereo vision [24]. These effects cause viewers to feel immersed in the image, which is the most attractive aspect of stereoscopic images. Moreover, studies show that an object, which has the pop-out effect often, catches a viewer's attention [25]. This phenomenon provides a useful depth cue for stereo saliency analysis, since objects with a popout effect are usually more salient than objects that have a deep-in effect. We assume that the object with the pop-out effect tends to be more salient than the other objects. In addition, we use color/depth contrast for the stereo saliency analysis. Hence, we propose a stereo contrast model to simulate the pop-out effect by combining the color/ depth contrast and pop-out value.
Another property of stereo vision is the viewing comfort zone based on the binocular information. Viewers may experience fatigue when they spend a long time watching stereoscopic images or video. The reason for this may be accommodation-vergence conflict or too much divergence [26,27]. A good stereoscopic image needs to minimize 3D viewer fatigue. This conflict increases as the perceived depth of an object becomes further away from the screen, as shown in Fig. 2. The zone close to the screen plane is called the comfort zone. Photographers usually make sure the more important objects are in the comfort zone when they capture a stereoscopic image or video. This is another depth cue for saliency analysis: the object in the comfort zone tends to be more salient than other zones. Studies show that the object near the zero disparity plane is more salient than those which are away from the zero disparity plane, which can be described by the linear formulation [20]. When a person watches one salient object, this object should be in the focus region [9]. According to the above phenomenon, in the perspective of the comfort zone, this object should meet two conditions: one is that it is located in or near the comfort zone and the second is that it is in the focus region. Therefore, we use monocular focus and comfort zone to analyze stereo saliency. The monocular focus assumes that the salient object is usually located in the focus region. The comfort zone is treated as a weight to adjust the importance of the object located in the focus region. The proposed stereo focus model is based on the comfort zone and monocular focus.
In order to describe the two mechanisms of the human visual system: pop-out effect and comfort zone, we have chosen to develop our proposed model on a combination of the stereo contrast and stereo focus models of the stereo-vision model. The stereo saliency of an object can be determined by the values calculated from the stereo contrast and stereo focus models. However, in some cases, the values obtained by these two models can be substantially different. For example, if an object has negative parallax and is far from the comfort zone, or if the object has zero parallax, the two values are quite different. To obtain the benefits from two models and detect the saliency for different stereoscopic content, our stereo visual saliency prediction model considers both the stereo contrast model and the stereo focus model.

Proposed stereoscopic visual saliency prediction model
The proposed stereoscopic visual saliency prediction framework is shown in Fig. 3. To capture the structural  information of the stereoscopic image, we first adopt a simple linear iterative clustering (SLIC) algorithm [28] for the segmentation. The SLIC algorithm can segment an input image (left image) into multiple uniform and compact superpixels. By controlling the number of superpixels in the SLIC algorithm, the image is segmented into multiscale images. Then, we calculate the saliency values individually by applying the stereo contrast and stereo focus models for each superpixel based on the left image and disparity map. An enhancement is based on clustering and increases the performance of the two models according to the experiments. Multi-scale fusion is then used to form the pixel-level stereo contrast and stereo focus maps. Last, the two maps are integrated by Bayesian integration to form the final stereo saliency map.

Pre-processing
In this paper, we convert the stereoscopic images from the RGB color space to the hue-saturation-value (HSV) color space. Compared to the RGB color space, the HSV color space is more consistent with the characteristics of human vision attention, and using it leads to a saliency value with higher accuracy [27].
As mentioned previously, we conduct multi-scale visual saliency prediction. Based on the number of superpixels, the input image (left image) is segmented into a set of non-overlapping superpixels in the scale s using the SLIC algorithm. s represents the scale of the segmentation. We chose the SLIC algorithm as the segmentation method because it is a fast and highly efficient segmentation algorithm that is sensitive to the boundary of the object [29]. Each superpixel t is described by the mean color feature {H, S, V}, coordinates of the superpixels {x, y}, and the mean disparity value d, x t = {H, S, V, x, y, d} t . The entire image can be represented as

Stereo contrast model
We propose the stereo contrast model based on the color/ depth contrast and the pop-out effect to calculate the saliency value (using a disparity map to analyze the pop-out effect). According to the human vision system, human attention is sensitive to a contrast region that includes color contrast and depth contrast [25]. The colors of the salient region are distinctive and contrast with the other regions. The depth discontinuity region may attract the viewer's attention when view positions or angles are changed. Therefore, the distinctive region may attract the viewer's attention to color/depth information. According to [30,31], humans pay more attention to those image regions that contrast strongly with their surroundings. Based on our observation, the distance between neighboring regions and the area of the region plays an important role in human visual attention. To simulate the above mechanism, we define the contrast value to measure the contrast of stereoscopic information.
Let DC(i, j) be the Euclidean distance between the vectorized superpixels i and j in HSV color space and DD(i, j) be the Euclidean distance between superpixels i and j in disparity. DC and DD are normalized to the range [0, 1]. We define the contrast measure C(i, j) between superpixels i and j as: where a is a control weight to balance the color and disparity contrast. Although several approaches [17,18,32] combining depth-saliency maps with 2D visual features have been proposed, any specific and standardized approaches still lack the combination of saliency maps from depth with 2D visual features. The work in [17,18] treats depth with the same importance as color. The work in [32] uses the adaptive weight for color and depth. In our experiments, we adopt a straightforward approach to merge color and depth contrast, treating depth contrast with the same importance as color contrast. We set a = 0.5 empirically. Let L(i, j) be the Euclidean distance between the position of superpixels i and j normalized to the range [0, 1]. According to the analysis above, we define the stereo contrast measure S(i, j) between a pair of superpixels i and j based on color, disparity, and spatial information: where ω j is the number of pixels in superpixel j and c is a control value for spatial information (c = 3 in our implementation). As mentioned above, the saliency of a superpixel z can be defined by its stereo contrast measure as: where R is the search range and SC R (z) is the saliency value of superpixel z in the search range. Figure 4 shows the global and local search range. Then, we compute the global and local saliency maps. When we compute the stereo contrast saliency value of the current superpixel, we do not compute all superpixels in the search range. We only choose the K most similar superpixels in the search range and use them to compute the stereo contrast saliency of the current superpixel. This is based on the experiments and [22], as using the k most similar superpixels to compute the stereo contrast can prevent the stereo contrast saliency value of an abnormal superpixel becoming too great. Therefore, in practice, to measure a superpixel's stereo contrast, we simply consider the K most similar superpixels. If the most similar superpixels are extremely different from the current superpixel, clearly all image superpixels are extremely different from it. In other words, to measure a superpixel's stereo contrast, there is no need to incorporate its stereo contrast value in all other superpixels in the search range. We simply consider K as the most similar superpixels. If most of the similar superpixels are extremely different from the current superpixel, clearly all image superpixels are extremely different from it. Therefore, we search for the K most similar superpixels k = {1, 2, ..., K}, kєR, where R is the search range. The local search is related to the search range R. (In practice, all distance is normalized to [0, 1] and we set R = 0.3 empirically.) Based on the observations of the experiments, we set K as 15 empirically. The local-global stereo contrast saliency of superpixel z is expressed as: According to the pop-out effect in Section 2, a region that has the pop-out effect may attract people's attention. Therefore, a pop-out effect describes the importance of the superpixel in stereoscopic saliency analysis. We treat the pop-out effect as a weight to enhance the stereo contrast saliency. Based on the work in [20] and our experiments, the superpixel of the pop-out effect can be represented by an exponential function of the disparity. We use d to represent the disparity, and d z is the mean disparity for superpixel z which is normalized to [−1, +1]. Let o be the pop-out value for superpixel z. If d z. < 0, it means that the superpixel has a pop-out effect. The saliency of this superpixel should increase, and if d z. > 0, it means the superpixel has a deep-in effect and saliency should decrease. The pop-out value can be expressed as follows: We use the local-global stereo contrast and the popout value to simulate the pop-out effect. Figure 5 is an example of a stereo contrast map. The stereo contrast SC(z) relies on the color/depth contrast, distance contrast, superpixel area, and pop-out value, which can be expressed as follows:

Stereo focus model
We propose a stereo focus model based on monocular focus and the comfort zone. According to the comfort zone as mentioned in Section 2, human visual attention can take the initiative to focus on the salient region by using monocular focus. Monocular focus can be detected by the focal blur [33], and we add the comfort zone to improve its accuracy. For monocular focus, sharp edges of an object may be spatially blurred when projected on the image plane. The degree of the blur model [9] can measure the focus/defocus for the edges of the image by computing the differential-of-Gaussian (DOG) operation in a different scale for the edge pixels. The monocular focus of the edge pixel p is F 2D (p). This value is sensitive to the edge pixels and is easy to implement. However, it is a 2D focus measure and is only useful for the edge pixels of the image. For stereoscopic analysis, we expand this model to measure the edge of the stereoscopic focus by combining the monocular focus and the comfort zone. Then, we expand the stereoscopic focus model from edge to region. According to our experiments, we use a comfort value to measure the comfort zone. The comfort value is a weight to indicate the object's importance by measuring the comfort zone. When multiple objects have zero or small disparity in the stereoscopic images and are located in the comfort zone, our observation is that their comfort values are similar. When they are far away from the zero disparity plane, their comfort values decrease sharply. Based on this observation, the comfort value complies with a Gaussian distribution. v(p) denotes the comfort value of pixel p. This can be expressed as: where d p represents the disparity of pixel p. σ 1 is the range of positive and negative disparity. α controls the weight of negative disparity. For negative disparity, we cannot directly follow the comfort zone model [20] to design our comfort value. The reason for this is that there is a conflict between the pop-out effect and comfort zone. If we directly use the comfort zone model [20] to measure saliency, in some cases, stereo contrast model and stereo focus model may give quite different results for an object with negative disparity, which will reduce the performance our proposed model. For example, if the pixel has a large negative disparity and is far from the comfort value, its pop-out value becomes big, and its comfort value is small. After the fusion of two models, the results may be not reliable. To reduce the errors caused by such conflicts, we increase the importance of the negative disparity in the comfort zone by using α to balance the comfort value of the negative disparity. There are two benefits in this modification. Firstly, this modification increases the importance of the pop-out effect for the object with the negative disparity. Secondly, it still keeps a high importance for the object in the comfort zone in stereoscopic saliency analysis. According to our experiments, our modification for the comfort zone works in most cases and improves the performance of the proposed model. We set the comfort value as a weight, because the comfort value describes the importance of the stereo saliency analysis. We define the stereo focus value of the edge pixels p by combining the monocular focus value F 2D with the comfort value. This is expressed as: It would be ideal to analyze the saliency for each object as a whole. However, it is difficult to segment an object accurately. Therefore, we compute the stereo saliency at the superpixel level instead. For each stereo focus value of the edge pixels, we filter it by using a Gaussian kernel of σ, equal to 1°of visual angle. This processing can effectively reduce noise, such as an isolated point. The stereo focus value of superpixel t relies on the stereo focus degree of all its pixels. Further, our observation is that a region with a sharper boundary usually stands out as being more salient. We set the boundary sharpness as a weight value, which can be represented by the stereo focus value of the boundary pixels. The stereo focus value SF(t) of superpixel t is formulated as: B t represents all the edge pixels in superpixel t, m is the number of edge pixels, and n is the number of all the pixels in superpixel t. The first term on the righthand side of Eq. 9 is the average value of the stereo focus value for all the edge pixels. The second term is the average value of the stereo focus value for all the pixels in superpixel t. The stereo focus model is combined with the monocular focus and the comfort value. Figure 6 shows the example of the stereo focus map.

Enhancement
The stereo contrast model and stereo focus model are superpixel level. To make the salient region more distinctive and separated easily, we propose an enhancement based on clustering for the two models. In practice, we use the k-means algorithm to cluster N superpixels to K clusters via the value of superpixel t. For simplicity, we use SV to represent SC and SF (SV = SC = SF). To enlarge the difference between neighboring clusters, each value of superpixel t belonging to cluster k (k = 1, 2, 3, …, K) is modified by considering its own value and the other superpixels in cluster k: where {k 1 , k 2 , …, k Nc } denotes the Nc superpixels in cluster k and t is one superpixel in cluster k. δ is the weight parameter. Sm(t) is the value of superpixel t belonging to cluster k. r tk i is a weight value that relies on the value of superpixels t and k i . The first term on the right-hand side of the equation is the weighted average of all the superpixels without superpixel t in cluster k, and the other is the weighted value of superpixel t. The weighted value is more sensitive to the spatial information of superpixel pairs: SD(k i , t) is the spatial distance between the superpixels k i and t. σ 2 is a weight to control the range of the spatial information. After re-calculating the value of each superpixel, the values of the important superpixels in cluster k are enhanced. Figure 7 gives an example in which two maps computed by the stereo contrast and stereo focus models are processed by the enhancement.
Since the content of each superpixel may have more than one object or texture, a single scale segmentation scheme is not suitable for objects of different sizes. We conduct multi-scale segmentation based on controlling the number of superpixels in the SLIC algorithm. At each superpixel scale size layer, both the stereo contrast and stereo focus models are individually applied to calculate their respective saliency values. A multi-scale pixel-level fusion is introduced to fuse the results for each model. Through this fusion, the saliency value for each pixel is calculated based on multi-scale saliency and its texture information.
To deal with the values in the different scales, we adopt the method to fuse the multi-scale layered value [34]. This method considers the multi-scale value and its textural information, which uses the textural feature of the pixel and its corresponding superpixel as the weight value to average the multi-scale value. For each pixel, the saliency value relies on the saliency value of each scale and its corresponding weight. The weight considers the textural information that relies on the difference between the current pixel value and superpixel value.

Bayesian integration scheme
At this stage, two saliency maps have been built based on the stereo contrast and stereo focus models. The next step is to integrate them; however, as has been discussed [35], good individual saliency maps may become worse maps when they are combined by using weights. Therefore, we adopt a Bayesian model to integrate the two saliency maps [36]. For the Bayesian model, each pixel's saliency can be estimated by the posterior probability. The Bayesian integration approach is suitable for dealing with two saliency maps. When we compute one saliency map, it treats the other saliency map as the prior while the current saliency map computes the likelihood. The specific steps are as follows: when we compute the saliency map S 2 ′ based on the Bayesian formula, using one saliency map S 1 computes the prior probability and using the other saliency S 2 computes the likelihood. After this, we use the saliency maps in the formula in the opposite way. In other words, S 2 then computes the prior and S 1 computes the likelihood. In this way, the saliency map S 1 ′ is computed. Finally, S 1 ′ and S 2 ′ are combined to obtain the final saliency map. Using this approach, it is possible to avoid reintroducing the noise in different saliency features, thereby obtaining a more accurate posterior probability. This model is very robust with regard to various types of images. After Bayesian integration, we use center bias to conduct post-processing to obtain the final stereo saliency map, because many datasets place the salient object or region in the center of the image [37]. Figure 7d is an example of the saliency map after Bayesian integration and center bias. The complete visual saliency prediction algorithm can be summarized as:

Results and discussion
In this section, we evaluate the performance of our proposed model on two eye-tracking datasets [17,18].
One supplies high-quality stereoscopic images and the other supplies low-quality stereoscopic images generated by Kinect-1. First, we present the quantitative metrics of evaluation for the proposed method in Section 4.1. To demonstrate the effect of the different component combinations of our algorithm, a performance comparison is given in Section 4.2. Last, we give a performance evaluation by comparing the proposed methods to state-of-the-art methods in Section 4.3.

Experimental setup
Our stereo saliency framework is based on the superpixel. In the experiment, we set the segmentation scale of superpixels in the SLIC algorithm. The number of superpixels was set as {600, 800, 1000, 1200}. The SLIC algorithm automatically adjusts the shape of each superpixel based on the segmentation scale and texture information of the image, which is sensitive to the boundary of the object. In stereo contrast, all distance is normalized to [0, 1] and we set R = 0.3 empirically. The main parameters of our proposed method are the number of clusters K and δ in Eq. (10). In the experiment, we varied K (K = 6, 8, 10, 12) and δ (δ = 0.4, 0.5, 0.6, 0.7), and observed that the saliency results were insensitive to both parameters. We set the number of clusters K = 10 and δ = 0.5. The parameters of σ 1 and σ 2 are given in Eqs. (7 and 11), we differed these values to [0.01, 3] and observed the saliency results. Then, we set σ 2 1 ¼ 0:8 and σ 2 2 ¼ 0:6. In Eq. (7), α is set to α = 0.5, which is the same as in [22].
We used one of the databases from [17]. This database is consistent with the characteristics of the HVS and includes 18 high-quality stereoscopic images of various types (e.g., indoor scenes, outdoor scenes, and scenes containing various numbers of objects). Some images in the database were collected from the Middlebury 2005/2006 dataset [38], which has high- The maps computed by the stereo contrast and stereo focus models. c The maps after clustering. d Final saliency map and ground truth accuracy depth maps, while others were produced from videos recorded using a Panasonic AG-3DA1 3D camera, which supplies high-quality left/right images. To avoid 3D fatigue resulting from conflict in the depth field (for example, one object is seen by the left eye but missed by the right eye), the degree of vergence in human vision was considered within the stereoscopic 3D viewing environment in this eyetracking experiment. The disparity of the stereoscopic images used is within the comfortable viewing zone. The conflict in different depth fields will not be detected by observers during the eye-tracking experiments. The gaze points are recorded by the eye-tracker and processed by a Gaussian kernel to generate the fixation density maps, which are used as the groundtruth maps.
The other eye-tracking database was published in [18]. This database supplies low-quality stereoscopic images compared with [17] and has 600 stereoscopic images that include outdoor and indoor scenes. These stereoscopic images generated by Kinect-1 are diverse in terms of the number and size of objects and the degree of interaction or activity depicted. The stereoscopic images only have a resolution of 640 × 480 and may have some noise because the depth map by the Kinect-1 has some holes and needs to be smoothed. The stereoscopic image pair is produced by pre-processing, calibration, and post-processing. The eye-tracking data are captured in both 2D and 3D free-viewing experiments by the eyetracker from 80 participants (ranging in age from 20 to 33 years old). Human fixation maps are constructed from the fixations of viewers to globally represent the spatial distribution of human fixations. Then, a Gaussian kernel is used to obtain the continuous fixation density maps as the ground-truth maps. This dataset supplies 2D and 3D fixation maps. To facilitate a comparison, we used 3D fixation maps as the stereoscopic 3D groundtruth maps.
To quantitatively evaluate the performance of the proposed model, we applied similar quantitative measuring methods to [17]. The performance of the proposed model was measured by comparing the saliency map with the ground-truth map supplied by the database. Because there are two images (left and right) for any stereoscopic image pair, we used the saliency map of the left image for comparison [17]. The area under the receiver operating characteristics curve (AUC) and the correlation coefficient (CC) were used to evaluate the quantitative performance of the proposed stereo visual saliency prediction model. Of these measures, the AUC is the area under the receiver operating characteristics (ROC) curve [39]. Using this score, human fixations were considered to be the positive set, and some points from the image were sampled to form the negative set.
The saliency map S was then treated as a binary classifier to separate the positive samples from the negatives. By thresholding over the saliency map and plotting the true positive rate versus the false positive rate, an ROC curve was generated for each image. Then, the ROC curves were averaged over all images and the area underneath the final ROC curve was calculated as the AUC [40]. Perfect prediction corresponds to a score of 1 while a score of 0.5 indicates a level of chance. To compute the AUC, each eye fixation density map and saliency map were normalized to [0, 1]. In practice, we set different thresholds from [0.01, 1]. The LCC measures the strength of a linear relationship between the predicted saliency map and the ground-truth saliency map. When CC is close to + 1/− 1, there is almost a perfectly linear relationship between the two variables.

Performance comparison with different combinations of components
Four main components were compared: stereo contrast, stereo focus, and enhancement and integration via the Bayesian scheme. The performance of different combinations of components is shown in Tables 1 and 2. SCM is the saliency map based on stereo contrast followed by multi-scale fusion. SFM is the saliency map based on stereo focus followed by multi-scale fusion. SCE is the saliency map based on stereo contrast followed by enhancement. SCE is the saliency map based on stereo contrast, followed by enhancement. OurWE is the proposed stereo saliency map without enhancement. Our model is the proposed stereo saliency map. Table 1 indicates that SFM performs better than SCM on the database in [17] in AUC and CC. Table 2 shows that SFM performs better than SCM on the database in [18] with AUC and CC. The two models performed differently on each database, so using either one to form the saliency map would not result in good performance. Tables 1 and 2 show that the enhancement slightly improves the performance of the two models with AUC and CC. However, if we remove the enhancement from our proposed model, the performance of our model will be affected. In order to verify the improvement of the  Tables 1 and 2, we can see that the contribution of stereo focus varies. In Table 1, stereo focus has a more important contribution than stereo contrast because the objects of the stereoscopic image from the database in [17] lie in different focus regions and stereo focus works more effectively. In Table 2, we can see that the contribution of stereo focus is less than stereo contrast because the content of the database in [18] is more sensitive to color/depth contrast. Thus, to deal with these different types of stereoscopic images, we designed our model based on both stereo focus and stereo contrast. Figure 8 shows examples of the proposed visual saliency prediction. We notice that the small cap is not detected as a salient region in the stereo focus model. The stereo focus is related to the monocular focus and comfort value. In this case, the zero disparity plane is at the big cap according to our comfort value. The monocular focus model detects the big cap as the focus region and the small cap is out of the focus region. Therefore, the salient region is the big cap region and the small cap is not the salient region in the monocular focus model. Even if we increase the weight of the comfort value (because the small cap is near the zero disparity plane and it pops out), it is not detected as the salient region according to the proposed stereo focus model. In stereo contrast model, the small cap is detected as the salient region because of the pop-out effect. Although the conflict between the stereo focus and stereo contrast still exists, our proposed model obtains the acceptable result that has the benefits from the stereo focus and stereo contrast models. This case shows that the stereo focus model may not work in the object with the negative disparity. For improving the performance of the proposed model, it is necessary to take the stereo contrast model into consideration.

Comparison of our proposed method with other methods
First, we compared the proposed model with other state-of-the-art methods [17]. We compared it with 2D saliency methods, mixed models, and stereoscopic 3D saliency models. The 2D saliency methods include IT [41], AIM [42], SR [43], and GBVS [44] (denoted as 2D model in Table 3). Mixed model means combining these 2D models with the depth saliency models proposed by [14] (denoted as 2D × depth (Chamaret)) and [17] (which have two models denoted as 2D + depth contrast and 2D + DSM). Model1, Model2, and Model3 were proposed by [17], which were computed by using the depth saliency model combining three 2D saliency models. We used a Bayesian integration [36] to process the 2D model and depth contrast saliency. For a fair comparison, we added center bias to process the results of the Bayesian integration. 2D + DSM considered the center-surrounded mechanisms. We then compared our proposed model with the stereoscopic 3D saliency model proposed by [45]. We should note that the stereo model in [45] has already taken the center bias into consideration. From Table 3, we can see that the performance is not improved significantly using the  depth information as a weighted value (2D × depth (Chamaret)) in AUC and CC. Directly using depth information as a weighted value for stereo saliency analysis does not achieve a good result because the method does not consider the actual characteristics of the depth information. By contrast, the performance of the 2D + DSM and 2D + depth contrast methods are better than the 2D × depth (Chamaret), precisely because both consider the characteristics of the depth information. Bayesian integration and center bias increase the performance compared with 2D + depth contrast methods. The performance of our proposed framework is the best of all the methods. Figure 9 gives the example of the proposed visual saliency prediction. Second, we used the published eye-tracking datasets in [18] with 600 3D images, including outdoor and indoor scenes, to evaluate performance. We used the 3D fixation maps as the ground-truth maps. Because we could not find the code of the DSM in [18], we could only compare our results with the best methods listed in their original paper. The comparative model is DSM, and the 2D saliency modes are IT [41], AIM [42], FT [46], GBVS [44], ICL [47], LSK [48], and LRR [49]. To compare the results of these models, we quantitatively evaluated their performance on the database of the proposed method, using AUC and CC [50]. The experimental results are shown in Table 4. Note that the AUC and CC values of the other existing models were taken from the original paper [18]. From this table, we see that the performance of our proposed model is the best of the 15 stereo visual saliency prediction models. Here, we notice that our proposed model does slightly better than the GBVS × DSM. The reason for this is that sometimes the pop-out effect and comfort zone will fail because the salient region may be located in the background or near the background. Therefore, although the results of our proposed model are  9 Stereo comfort zone based on human stereo vision. DSM represents the depth saliency map in [17] better than the other existing models, it is not much better than GBVS × DSM.

Conclusions
In this paper, we exploit two characteristics of stereoscopic vision and propose stereo visual saliency prediction based on stereo contrast and stereo focus. Stereo contrast is a product of color and depth contrast and the pop-out effect describes the contrast in objects. Stereo focus is based on the focus mechanism of human stereo vision, which describes the region of human focus. For each value of the two models, we individually enhanced the important region to make it more distinctive. The two values were individually converted into two saliency maps using multi-scale fusion. Lastly, both saliency maps were integrated using Bayesian integration. Experimental results show that our proposed model can process stereoscopic images from different stereoscopic capture devices to achieve the best performance on two eyetracking databases compared to existing methods. In the present study, even if the performance of the proposed model is good, our model still suffers from some limitations. The main one is that in some cases, the pop-out effect and comfort zone may fail in stereoscopic saliency analysis. For example, if the salient region is located near the background, the performance of our model will decrease. The reason for this is that this case is not suitable for our assumption that the salient region should be located in the comfort zone or have the pop-out effect. In the future, we will exploit more mechanisms of HVS for saliency analysis. We try to find out how to deal with the conflict between popout effect and comfort zone and how to improve the accuracy of the salient region if the pop-out effect and comfort zone are not working very well. Additionally, we will exploit more features (such as texture contrast, luminance contrast, the property of divergence, and different monocular focus approaches) to improve our proposed model in different color spaces.  "+" means the combination by simple summation as in the study in [18]. "×" means the combination by point-wise multiplication [18]. DSM represents the depth saliency map in [18]