Skip to main content

A unified framework for spatiotemporal salient region detection

Abstract

This article presents a new bottom-up framework for spatiotemporal salient region detection. The generated saliency map can uniformly highlight the salient regions. In the proposed framework, the spatial visual saliency and the temporal visual saliency are first computed, respectively, then they are fused with a dynamic scheme to generate the final spatiotemporal saliency map. In the spatial attention model, the approach of joint embedding of spatial and color cues is adopted to compute the spatial saliency map. In the temporal attention model, we propose a novel histogram of average optical flow to measure the motion contrast of the different pixels. The method can suppress the motion noise efficiently because the statistical distribution of optical flow in a patch is comparatively stable. Furthermore, we combine the spatial and the temporal saliency maps through an adaptive fusion method, in which a novel motion entropy is proposed to evaluate the motion contrast of the input video. Extensive experiments demonstrate that our method can obtain higher quality saliency map compared with state-of-the-art methods.

1 Introduction

Human visual system has an excellent ability to quickly catch salient information from complex scenes. The mechanism in the brain that determines which part of the visual data is currently of the most interest is called selective attention [1]. The mechanism is critical for human to understand scenes. In recent years, many computational models have been proposed to mimic the mechanism of selective visual attention. The models can compute saliency maps from image or video inputs. The pixels with higher intensity values in saliency map denote that the corresponding pixels are visually important. The saliency map can be used for applications, such as object-of-attention segmentation [24], object detection [5, 6], image and video summarization [7], video surveillance [8], and image and video compression [9].

The existing methods of saliency detection can roughly be categorized into local and global methods. The local contrast-based methods compute the saliency of a specific image region based on it’s local neighborhoods [1013]. The global contrast-based methods estimate the saliency of an image region by taking the contrast relations to the entire image into account [1419]. In contrast to the weakness of the local methods which usually highlight the object boundary instead of the entire object, the global methods can generate saliency maps with full resolution and uniformly highlighted regions. However, only simple features are considered in calculating the motion saliency, such as flicker, which limits the models to the static background. If the salient object and background change simultaneously, the quality of the estimated saliency map is degraded rapidly.

In this article, we propose a novel unified framework which can detect salient regions in both images and videos flexibly. We consider the definition of visual saliency in the global approach. Thus, we assign more visual saliency to the features which are less frequent. Our approach which is extended from our previous work [20, 21] computes spatiotemporal saliency map based on a global scheme in spatial and temporal domain. Given a video, we first compute the spatial saliency map which adopts joint embedding of spatial and color cues. As for the temporal saliency, we compute the global motion contrast of dense optical flow. To suppress the motion noise, a new histogram of average optical flow (HOAOF) is proposed to compute the motion contrast of different pixels. Finally, a novel adaptive fusion technique is proposed to combine the spatial and the temporal saliency maps.

The main contributions of the article are as follows:

  1. (1)

    We propose a powerful unified framework for spatiotemporal salient region detection, which can obtain higher quality salient region detection results than existing methods no matter whether the camera is fixed or not.

  2. (2)

    A new HOAOF is proposed to compute the motion contrast in pixel-level. The descriptor can suppress the motion noise effectively because the statistical distribution of optical flow in a patch is comparatively stable.

  3. (3)

    We propose a novel adaptive fusion scheme to combine the spatial and the temporal saliency maps. The motion contrast of video sequence is measured by motion entropy. If a video has strong motion contrast, then the motion entropy of the video will be small. Correspondingly, the temporal attention model can be assigned a high weight, and vice versa.

The remainder of the article is organized as follows. In Section 2, the related work is discussed. The spatiotemporal salient region detection method is proposed and elaborated in Section 3. The experimental results and comparisons with other methods are provided in Section 4. In Section 5, we first discuss the connections between our approach and the related methods. Then, the limitation of the proposed method is also analyzed. Finally, the conclusion is presented in Section 6.

2 Related work

Visual attention can be determined by two categories of factors: bottom-up factors and top-down factors. The idea of bottom-up attention is to seek for the “visual pop-out” saliency. The salient signals are driven solely from the visual scene. On the contrary, both the cognitive factors and the high-level stimulus (e.g., face [22], person and car [23]) are considered in top-down attention. The approach expects latent correlations between visual attributes and saliency values and aims to mine such correlations from the training data [24]. Since the data-driven stimuli are easier to control than the cognitive factors, and the exact interaction between bottom-up process and top-down process still remains elusive [25], bottom-up attention mechanisms are investigated more than top-down mechanisms.

The bottom-up models driven by low-level features can be classified into two schemes: local and global. The local contrast-based methods explore a salient feature depending on its neighborhoods. In the well-known bottom-up attention model [10], three basic low-level features (i.e., color, intensity, and orientation) are used to generate three conspicuity maps by computing the center-surround contrast. Then, the conspicuity maps are combined into a single saliency map. Based on [10], Itti et al. [12] extend the saliency detection model from the static scenes to the dynamic video clips by introducing two simple features: flicker and motion. The flicker is computed from the absolute difference between the luminance of consecutive frames. The motion is computed from spatially shifted differences between Gabor pyramids from the consecutive frames. Kim et al. [13] propose a novel method for spatiotemporal salient region detection. The approach combines the spatial saliency and the temporal saliency with fixed weight. For calculating temporal saliency, the authors simply compute the sum of absolute difference between the temporal gradients of the center and the surrounding regions.

A common limitation of the local scheme is that the generated saliency map usually produces high saliency values in the object boundary instead of the entire salient objects when only one scale is considered. In general, a multi-scale fusion scheme is used to alleviate the boundary-emphasize effect. The methods can obtain high equality motion saliency map when the camera is static. Once the camera is moving, the quality of the produced motion saliency map may be degraded rapidly. To overcome the problem, Le Meur et al. [26] apply motion contrast to compute the temporal saliency. You et al. [27] estimate the global motion to compensate camera’s motion and determine the video attention regions.

The global contrast-based methods integrate the entire information features all over the visual field. The approaches in [14, 17] calculate the saliency map based on the Fourier frequency spectrum. In [17], the difference between the original signal and the smooth one in the log amplitude spectrum is calculated, and then the saliency map is obtained by transforming the difference to the spatial domain. Guo and Zhang [14] use image’s phase spectrum of Fourier transform instead of amplitude spectrum to calculate the saliency map. Furthermore, the phase spectrum of quaternion Fourier transform (PQFT) is applied to detect the spatiotemporal saliency in the dynamic scenes. In Guo and Zhang’s model, intensity, color, and motion features are comprised into a quaternion image as an individual channel for taking phase spectrum. These features’ contribution is equivalent to each other in the final saliency map. However, the psychological studies reveal that the motion contrast usually attracts more human attention than other external signals [28].

3 The proposed method

In this section, we give a detailed description of our spatiotemporal salient region detection approach. We introduce the spatial attention model in Section 3.1. Then, we show how to calculate temporal saliency in Section 3.2. In Section 3.3, we show how to fuse spatial saliency map and temporal saliency map adaptively.

3.1 Spatial attention model

The spatial saliency map is computed from joint embedding of spatial and color cues. Three factors are considered to compute the individual saliency maps, respectively. The final spatial saliency map is generated by combing these maps in a two-layer saliency structure. Please refer to [20, 21] for more details about the spatial attention model.

3.1.1 Spatial constraint

The first factor is spatial constraint (SC). It is based on the observation that a pixel is salient when the adjacent pixels have strong contrast with respect to it, while a pixel is less salient when the strong contrast pixels are far away from it. Moreover, according to the “center-surround” inhibition mechanism, the surrounding pixels should make a greater contribution when calculating global contrast-based saliency. The SC-based saliency of a pixel can be formulated as

Sal SC (p)= q I α p , q I p I q ,
(1)

where I p is the CIELAB color value of pixel p, I p I q is the Euclidean distance between I p and I q . The SC factor α p,q is defined as

α p , q = 1 Z exp( p q 2 Π 1 2 )exp( Σ ( q ) Π 2 2 )
(2)

where Z denotes the normalization factor, pq is the spatial distance between pixels p and q, and Σ(q) is the sum of distance to all other pixels. The surround pixels have larger Σ(q) than the center ones. The parameters Π 1 2 and Π 2 2 are set to 300 and 0.06, respectively, in our experiments. We use fixed parameters for all datasets in order to perform fair comparison. The same principle is employed for all of the parameters discussed in the following sections. The obtained saliency map for Figure 1a using SC saliency is shown in Figure 1b.

Figure 1
figure 1

Example of image saliency detection. (a) Original image. (b) SC saliency map. (c) CD saliency map. (d) SD saliency. (e) The final saliency map is obtained through pooling mechanism [10]. (f) The final saliency map is synthesized through the two-layer structure.

3.1.2 Color double-opponent

The second factor is color double-opponent (CD), which is the color channel representation in cortex. The physiological study shows that the red–green (RG) and blue–yellow (BY) contrast have major impact on human attention [10]. We use G RG(p) and G BY(p) to represent the global contrasts of RG and BY, e.g., G RG (p)= 1 N q I |RG(p)RG(q)|, supposing the image has N pixels. Then, the CD-based saliency of a pixel p is expressed as

Sal CD (p)= G RG ( p ) + G BY ( p ) β ( p ) ,
(3)

where the normalization factor β(p)= maxq{|RG(p)−RG(q)|,|BY(p)−BY(q)|},qI. The obtained saliency map for Figure 1a using CD saliency is shown in Figure 1c.

3.1.3 Similarity distribution

The third factor is similarity distribution (SD). In general, the background can be distributed over the entire image exhibiting a high spatial variance, whereas the foreground objects are generally more compact. Based on the observation, the SD-based saliency for a pixel p is defined as

Sal SD (p)=exp π ( p ) Π 3 2 ,
(4)

where the parameter Π 3 2 is set to 0.2, π(p) is the SD.

π(p)= 1 N q I 1 Z γ p , q pq 2 ,
(5)

where Z denotes the normalization factor, γ p,q [0,1] measures the similarity between two pixels [20].

For the pixel p inside an object, π(p) can be approximated as the sum of distance to other pixels in the same object which is smaller. So it is more likely to assign pixels which belong to the same salient object large SD saliency values and vice versa. The obtained saliency map for Figure 1a using the SD saliency is shown in Figure 1d.

3.1.4 Two-layer fusion scheme

After the three saliency components are computed, the final saliency can be constructed from two layers [29], i.e., basic layer and enhancement layer, which are defined as follows:

  1. (1)

    The SC saliency is employed as the basic layer.

  2. (2)

    The enhancement layer is designed based on the CD and SD saliency.

According to the two-layer fusion scheme [29], we can obtain the final saliency map

S S (p)= Sal SC (p)(1+ w 1 Sal CD (p)+ w 2 Sal SD (p)),
(6)

where the weight factors w 1 and w 2 regulate the extent of importance for the CD and the SD saliency. In our experiments, we set w 1=w 2=1.

In the two-layer saliency fusion scheme, the basic layer (SC) always works when the CD or the SD is either high or low. The enhancement layer (CD and SD) aims to attract more human attention when the CD contrast is strong or the SD is compact. As shown in Figure 1e,f, the saliency map constructed from two-layer structure highlights the two salient objects (i.e., pastry and plate) more uniformly than the pooling mechanism used in [10].

3.2 Temporal attention model

In the temporal attention model, temporal saliency maps are often calculated by temporal gradient which is computed by using the intensity difference between successive frames. The models work well when the camera is static. Once the camera moves, the evaluated saliency maps will incorporate much noise. In this study, we find that the object in video sequences exhibiting high motion saliency usually has the following properties: (a) there are clear motion patterns in the scene; (b) the motion of object exhibits difference from the global motion of the scene; (c) compared with the size of the scene, the object is relatively small.

Based on the observations, we define the saliency of a pixel as its motion contrast to all the other pixels in the frame. In this study, we select dense optical flowa [30] which is a modified version based on [31, 32], to compute the motion field, because the computational complexity is low. The pixel’s saliency can be formulated as

S T (p)= q I |D( V p , V q )|,
(7)

where V p and V q are the optical flow of pixels p and q in frame I, respectively, D(V p ,V q ) is the vector difference between the optical flow of pixels p and q, and |.| represents magnitude of vector.

Due to the changing illumination conditions or the fixed camera noise, there is a considerable noise in the estimated optical flow. Moreover, if multiple motion layers exist in the scene, inaccurate estimation of optical flow may yield at the edge pixels in different motion regions. If we use formula (7) to compute the saliency map directly, much noise will be generated. Some examples present in the third column of Figure 2. In contrast, the statistical distribution of optical flow in a patch is comparatively stable. To suppress the background noise, we propose a novel HOAOF to measure the motion contrast of the different pixels. Specifically, the optical flow is first computed at every frame of video sequence. Then, a smooth procedure is applied to the optical flow. Finally, the histogram of optical flow belonging to the local patch centered at the p th pixel is generated. Flow orientations are quantized into N levels according to its primary angle from the horizontal axis and weighted according to its magnitude. The HOAOF can be defined as follows:

H p = ( h p , 1 , h p , 2 , , h p , N ) with h p , n = ( x , y ) w p θ ( x , y ) n m ( x , y ) ,
(8)
Figure 2
figure 2

Examples of motion saliency detection. (a) A pedestrian under a static camera attracts attentions. (b) The camera is tracking a walking person, who attracts more attention. First column: sample frames of two videos. Second column: the corresponding optical flow. Third column: motion saliency map calculated with optical flow directly. Fourth column: motion saliency map calculated with HOAOF.

where m(x,y) and θ(x,y) denote the flow magnitude and the quantized orientation at the pixel position (x,y) of a frame, respectively. The parameter w p is a local patch centered at the p th pixel. The number of bins N is set to 4 and w p is set to 7×7 pixels in our experiments. Figure 3 illustrates the procedure.

Figure 3
figure 3

Histogram formation with four bins, N = 4 . The optical flow is quantized into one of four cardinal directions (up, down, left, and right).

So, the formula (7) can be rewritten as

S T (p)= q I D( H p , H q ),
(9)

where H p and H q represent the HOAOF of local patch centered at the pixel p and q in frame I, respectively, D(H p ,H q ) is χ2 distance between the two histograms:

D( H p , H q )= 1 2 k ( H p ( k ) H q ( k ) ) 2 H p ( k ) + H q ( k )
(10)

Finally, the temporal saliency map is normalized to a fixed range [0,1].

An example is shown in Figure 2. It is clear that the temporal attention model can suppress the background noise efficiently. In Figure 2a, the camera is fixed and the global motion is nearly static. Compared with the background, the moving pedestrian produces a high-salient region in the frame. In Figure 2b, the camera tracks a pedestrian such that the person has small optical flow, while the background has large motion. In this case, the direction of global motion is opposite to that person. A clearly saliency map can still be obtained by using our model. If a dynamic scene has strong motion contrast, the main motion will be gathered in few directions and the motion object will pop out explicitly.

3.3 Adaptive fusion

We have obtained the spatial and the temporal saliency maps separately. The two maps need to be fused in a meaningful way to generate the final spatiotemporal saliency map. It is shown in [28] that the human vision system is more sensitive to motion information compared with the static signals. In a dynamic scene, the camera is tracking a pedestrian, while the motion direction of background is opposite to the camera’s movement. In general, people are more interested in the followed person instead of his surrounding regions. In surveillance video, the camera is fixed and the moving objects in video attract more human attention than the static background. In these examples, motion contrast is the prominent feature for the saliency detection compared with other features, such as intensity, texture, and color. In contrast, if the motion of the video is cluttered or the motion contrast is insignificant, human attention is attracted more to the contrasts caused by the static visual stimuli. Thus, simple linear combination with fixed weights between the spatial saliency map and the temporal saliency map may lead to unsatisfactory result. Instead, we adopt an adaptive fusion scheme, which is consistent with the above considerations. The adaptive fusion scheme can give higher weight to the temporal saliency map when strong motion contrast is present in the dynamic scene. In contrast, a higher weight is assigned to the spatial saliency map when the motion contrast is weak.

In this article, the motion entropy is proposed to evaluate how strong the motion contrast is in the video sequence. First, the HOAOF of the frame is calculated. Second, according to the HOAOF, the motion entropy is computed as follows:

E= i = 0 L h i log h i ,
(11)

where L is the number of bins. The parameter h i is the value of i th bin in HOAOF. The more cluttered the distribution of motion direction in video frame is, the larger the entropy is, and vice versa.

It is important to note the differences from the aforementioned HOAOF. First, we use one additional bin with i=0 which incorporate all pixels that the flow magnitude is lower than a preset threshold. For instance, in the surveillance video, there is considerable motion noise in the static background. To weaken the effect of flow noise on the motion entropy computing, we collect the flow into an individual bin. Second, the number of bins L is set to 16. A relative fine quantization is beneficial for the estimated entropy to reflect the motion distribution correctly. Two examples of the spatiotemporal saliency detection are shown in Figures 4 and 5. The first columns show sample frames of two different videos, respectively. In the first video, the moving car and person have a strong relative motion with respect to the static background. Thus, a higher weight is assigned to the temporal attention model. On the contrary, the second video has a cluttered background motion because the grass and branches present irregular motion. In this case, our attention is attracted more to the static visual stimuli (e.g., color) than motion. Hence, our algorithm allocates high weight to spatial saliency map. Attribute to the adaptive fusion scheme, the fused spatiotemporal saliency map, successfully detects the pedestrian, car, and bird as salient region.

Figure 4
figure 4

The example saliency detection in Video Set 1. Column (a) shows two example frames of Video Set 1 (PETS2001). Yellow boxes represent the moving objects; Column (b) presents spatial saliency maps; Column (c) is temporal saliency maps; Column (d) is the fused spatiotemporal saliency maps.

Figure 5
figure 5

The example saliency detection in Video Set 2. Column (a) shows two example frames of bird; Column (b) presents spatial saliency maps; Column (c) is temporal saliency maps; Column (d) is the fused spatiotemporal saliency maps.

4 Experimental results

In this section, we first introduce the datasets used for performance evaluation. Then, we compare the proposed method with three state-of-the-art methods [11, 13, 14] and provide the qualitative and quantitative results, respectively.

4.1 Video sequences datasets

The performance of the proposed algorithm is evaluated extensively on two types of videos, named Video Set 1 and Video Set 2, respectively. Video Set 1 contains surveillance videos, which are collected from PETS2001.b There are 6,000+ images totally. In this dataset, the camera is fixed and the background is still. People’s attention is mainly attracted to the moving objects [28], such as the pedestrian and the moving car. The examples of frames are shown in Figure 4. Since the size of the moving object is small, we use the bounding boxes of the moving objects as the ground truth. We collect Video Set 2 with 60 video clips from the Internet and the video segmentation datasets [33]. Each video clip contains about 60–200 frames with the same salient objects. There are 6,000+ images totally. Different from Video Set 1, the camera in this dataset is moving or the background presents clutter motion when the camera is still. It means that the objects and the background of the scenes are moving. Since the size of salient objects in Video Set 2 is large, the annotated ground truth masks are object-contour based.

4.2 Performance evaluation

In Figure 4a, we show the representative frames of Video Set 1, as well as the individual saliency detection results of the proposed method in different stages. Figure 4b is the computed spatial saliency map. Figure 4c is the temporal saliency map. The fused spatiotemporal saliency map is presented in Figure 4d. It is seen from the figure that the spatial saliency map does not highlight the salient object successfully. The main reason is that the scene has the highly texture background and the static features of the small foreground objects are not significantly distinctive. However, compared with the still background, the moving foreground objects have strong motion contrast. This leads to a temporal saliency map which can detect the moving salient objects clearly. In our adaptive fusion scheme, the strong motion contrast results in the dominant contribution of temporal attention model in the final spatiotemporal saliency map. As shown in Figure 4d, the effect of spatial saliency map is negligible. Another example in Video Set 2 is presented in Figure 5. The video records a bird by a fixed camera in the wild where the branches of the background present clutter motion. The motion contrast in the scene is weak, so that the spatial saliency is dominant over the temporal saliency in the adaptive fusion scheme. The spatial, the temporal, and the spatiotemporal saliency maps are shown in Figure 5b–d, respectively.

Recently, two spatiotemporal saliency models were presented in [13, 14]. To justify the effectiveness of proposed model, we compare the proposed method with PQFT model [14] and Kim et al.’s model [13] in Figures 6 and 7. The pedestrian and the moving car in Figure 6 are captured as salient region by all models. Compared with PQFT [14], Kim’s method assigns the moving objects much higher saliency value. Nevertheless, the highly texture backgrounds are not suppressed by these models. In contrast to that, our method not only detects the pedestrian and the car as the most salient regions but also suppress the background area effectively. This is attributed to the adaptive fusion scheme of our framework. The strong motion contrast can lead to the dominant contribution of temporal attention model in the final spatiotemporal saliency map. Figure 7 shows the example images from the different video segments of Video Set 2 and the computed saliency maps using different models. The methods in [13, 14] cannot detect the salient region successfully due to the changing background, while our model can detect the salient region clearly.

Figure 6
figure 6

Spatiotemporal saliency in video sequences and salient object extraction. Column (a) shows the representative frames of Video Set 1. Yellow boxes represent the ground truth; Columns (b), (d), and (f) show the spatiotemporal saliency map of PQFT [14], Kim’s [13], and proposed method, respectively; Column (c), (e), and (g) show the extracted objects by corresponding models; (h) is the generated binary mark of [11].

Figure 7
figure 7

Spatiotemporal saliency in video sequences and salient object extraction. Column (a) shows the representative frames of Video Set 2; Column (b) shows the pixel-wise ground truth annotation; Column (c), (e), and (g) show the spatiotemporal saliency map of PQFT, Kim, and proposed method; Column (d), (f), and (h) show the extracted objects by corresponding models; Column (i) shows the generated binary mask of [11].

Finally, the spatiotemporal saliency map of the frame in the video sequence is formulated as follows:

S ST (p)=(1 w t ) S S (p)+ w t S T (p)
(12)

with w t =eαE, α is a constant factor which adjust the weight. In our experiments, we set α=0.15. Our model can also deal with static images easily by setting w t =0.

Furthermore, the proposed model can be employed to extract the salient objects from the video sequences by thresholding the spatiotemporal saliency map via a moderate threshold. To this end, a non-parametric significance testing is adopted [34].We compute the empirical PDF from all the saliency values and set a threshold to achieve 95% confidence level in deciding whether the given values are in the extremely right tails of the estimated distribution. In addition to the comparison between the methods in [13, 14], we also compare the proposed method with Liu et al.’s model [11], which is a salient object detection method.In [11], a group of static and dynamic saliency features are computed and the optimal linear weights are learned through CRF learning method. Given an image pair, the model outputs a binary label map, which is further transformed to a bounding rectangle representing the salient object. In order to facilitate comparison, we take the binary label map of [11] as the detection result.In Video Set 1, the background is static, which is different from Video Set 2. Training in two video sets together can degrade the overall performance of salient object detection. So, we train CRFs in two datasets separately. In Video Set 1, the surveillance video is divided into 60 video segments. We randomly select 40 video segments with 2,000+ image pairs to construct a training set, and use the others for testing. In Video Set 2, we randomly select 40 video segments with 2,000+ image pairs to construct a training set, and use the others for testing.

The subjective results are shown in Figures 6 and 7. For the quantitative comparison, precision, recall, and F-beta, which are defined in [18], are computed by comparing the segmented region and the ground truth. To perform comparison experiment in the same settings, we use the method in [13] to calculate the performance index from 15 frames which are taken from every test video segment randomly, and then averaged in the test set of each Video Set. Because of the small change of scene and motion patterns in each short video segment, the variance of computed performance indexes in each video segment is small. The results are shown in Tables 1 and 2. It is clear that Kim et al.’s model [13] outperforms PQFT [14] in Video Set 1, but it is lower than Liu et al.’s model [11] and our method. Compared with the results of Kim et al.’s model, our method yields 3, 13, and 6% gain with regard to recall, precision, and F-beta, respectively. The performance of the salient object detection in [11] is superior to our method, which is mainly attributed to the static background and the singular motion pattern. It is beneficial for CRF learning. In Video Set 2, our method presents the optimal performance. The methods in [13, 14] fail to extract salient objects mainly because the methods cannot deal with the scene with background moving. Due to the diverse scene and motion patterns, learning the optimal linear weights of various saliency features to satisfy all situations is difficult. The performance of the salient object detection of [11] in Video Set 2 is not as good as in Video Set 1. Compared with the results of [11], our method yields 8, 20, and 7% gain with regard to recall, precision, and F-beta, respectively. To further verify whether the differences between these methods are statistically significant, we use approximate randomization [35] for statistical significance testing on F-beta. The test results (Tables 1 and 2) show that our model outperforms [13, 14] in all evaluations with strong statistical significance. In Video Set 2 our model outperforms [11] significantly, while [11] outperforms our model significantly in Video Set 1. The main reason is the background in Video Set 1 is still, which is beneficial for CRF learning.

Table 1 Performance evaluation for salient object extraction
Table 2 Performance evaluation for salient object extraction

5 Discussion

In this section, we first discuss the difference between our approach for salient region detection and other saliency detection models similar with Itti et al.’s model [10]. Furthermore, we discuss the limitations and implement failure analysis of the proposed method.

Salient region versus visual saliency

Salient region detection is different from the visual saliency computation in [10, 36] or other based on the biologically plausible computational models of attention. Itti et al.’s model [10] and those similar to it usually focus on mimicking the properties of vision and predicting eye fixations. The resulting saliency maps are often overemphasize small, purely local features, and fail to detect the internal part of the target, which makes the approach less useful for applications, such as segmentation and detection. This kind of model is usually evaluated by comparing the saliency map with the real human attention density map. Salient region detection method is part of the computational approach which is inspired by the biological theory, but is closely related to the typical applications in computer vision, such as adaptive content delivery, adaptive region-of-interest-based image compression, salient object segmentation [37], and object recognition. The resulted saliency map can uniformly highlight the entire salient regions in scenes. This kind of model is usually evaluated by comparing the resulted saliency map with the manually labeled binary ground-truth mask, such as [18, 38].

5.2 Limitations

Since the proposed salient region detection method is based on global scheme, the computation cost is high. Suppose there are N pixels in an image, the computational complexity is proportional to O(N2). In Table 3, we give the average running time of our approach and the others on the benchmark videos.

Table 3 Comparison of running times

For the proposed method, the motion saliency is computed based on the assumptions given in Section 3.2. If the videos with strong motion contrast do not comply with the assumptions, the computed saliency map will be incorrect. For example, when the salient motion object accounts for a large proportion of the scene, the resulting saliency map will highlight the background instead of the salient object. An example is shown in Figure 8a. In addition, according to the fusion scheme, the final saliency detection result is mainly determined by static saliency if the scene’s motion contrast is lower. The static saliency detection method itself cannot produce good result when the scene has the highly texture background. An example is shown in Figure 8b.

Figure 8
figure 8

Failure cases. Original images are shown in row 1. The corresponding saliency maps are shown in row 2.One failure case is shown in (a), and another failure case is shown in (b)” in the end of figure caption.

6 Conclusion

In this article, we propose a novel spatiotemporal salient region detection framework based on global scheme. The saliency maps are calculated separately by using the static and motion information of the videos. In the spatial attention model, we adopt joint embedding of spatial and color cues. The pixel-level saliency map is computed by using three components which are SC, CD, and SD. In the temporal attention model, the dense optical flow is used to calculate the global motion contrast of object in dynamic scene. To suppress the produced noise while estimating optical flow, a novel HOAOF is proposed to measure the motion contrast. To achieve the final spatiotemporal saliency map, an adaptive fusion scheme is adopted to combine the spatial and the temporal saliency. The dynamic weights of the two individual components are controlled by the motion entropy of the video frames. Extensive experiments show that the proposed method can obtain higher quality salient region detection results than existing methods no matter whether the camera is fixed or not.

Endnotes

aCode is available from http://people.csail.mit.edu/celiu/OpticalFlow/.

b http://ftp.pets.rdg.ac.uk/pub/PETS2001.

References

  1. Frintrop S, Rome E, Christensen H: Computational visual attention systems and their cognitive foundations: a survey. ACM Trans. Appl. Perception (TAP) 2010, 7(1):1-39.

    Article  Google Scholar 

  2. Meng F, Li H, Liu G, Ngan K: Object co-segmentation based on shortest path algorithm and saliency model. IEEE Trans. Multimed 2012, 14(5):1429-1441.

    Article  Google Scholar 

  3. Jung C, Kim C: A unified spectral-domain approach for saliency detection and its application to automatic object segmentation. IEEE Trans. Image Process 2012, 21(3):1272-1283.

    Article  MathSciNet  Google Scholar 

  4. Li H, Ngan K: Saliency model-based face segmentation and tracking in head-and-shoulder video sequences. J. Visual Commun. Image Represent 2008, 19(5):320-333. 10.1016/j.jvcir.2008.04.001

    Article  Google Scholar 

  5. Gao D, Vasconcelos N: Integrated learning of saliency, complex features, and object detectors from cluttered scenes. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR), Vol. 2. Los Alamitos, CA, USA; 2005:282-287.

    Google Scholar 

  6. Li H, Ngan K: A co-saliency model of image pairs. IEEE Trans. Image Process 2011, 20(12):3365-3375.

    Article  MathSciNet  Google Scholar 

  7. Cheng W, Wang C, Wu J: Video adaptation for small display based on content recomposition. IEEE Trans. Circuits Syst. Video Technol 2007, 17(1):43-58.

    Article  MathSciNet  Google Scholar 

  8. Mahadevan V, Vasconcelos N: Background subtraction in highly dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2008:1-6.

    Google Scholar 

  9. Liu K: Prediction error preprocessing for perceptual color image compression. EURASIP J. Image Video Process 2012, 2012(1):1-14. 10.1186/1687-5281-2012-1

    Article  Google Scholar 

  10. Itti L, Koch C, Niebur E: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(11):1254-1259. 10.1109/34.730558

    Article  Google Scholar 

  11. Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H: Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell 2011, 33(2):353-367.

    Article  Google Scholar 

  12. Itti L, Dhavale N, Pighin F: Realistic avatar eye and head animation using a neurobiological model of visual attention. In SPIE. San Diego, CA, USA; 2003:64-78.

    Google Scholar 

  13. Kim W, Jung C, Kim C: Spatiotemporal saliency detection and its applications in static and dynamic scenes. IEEE Trans. Circuits Syst. Video Technol 2011, 21(4):446-456.

    Article  MathSciNet  Google Scholar 

  14. Guo C, Zhang L: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process 2010, 19(1):185-198.

    Article  MathSciNet  Google Scholar 

  15. Luo W, Li H, Liu G, Ngi Ngan K: Global salient information maximization for saliency detection. Signal Process.: Image Commun 2011, 27(3):238-248.

    Google Scholar 

  16. Zhai Y, Shah M: Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th annual ACM international conference on Multimedia. New York, NY, USA; 2006:815-824.

    Chapter  Google Scholar 

  17. Hou X, Zhang L: Saliency detection: a spectral residual approach. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2007:1-8.

    Google Scholar 

  18. Achanta R, Hemami S, Estrada F, Susstrunk S: Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2009:1597-1604.

    Google Scholar 

  19. Li H, Ngan K: Unsupervised video segmentation with low depth of field. IEEE Trans. Circuits Syst. Video Technol 2007, 17(12):1742-1751.

    Article  Google Scholar 

  20. Xu L, Li H, Wang Z: Saliency detection from joint embedding of spatial and color cues. In 2012 IEEE International Symposium on Circuits and Systems (ISCAS). Piscataway, NJ, USA; 2012:2673-2676.

    Chapter  Google Scholar 

  21. Xu L, Li H, Zeng L, Ngan KN: Saliency detection using joint spatial-color constraint and multi-scale segmentation. J. Visual Commun. Image Represent 2013, 24(4):465-476. 10.1016/j.jvcir.2013.02.007

    Article  Google Scholar 

  22. Cerf M, Harel J, Einhäuser W, Koch C: Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems. New York, NY, USA; 2008:241-248.

    Google Scholar 

  23. Judd T, Ehinger K, Durand F, Torralba A: Learning to predict where humans look. In Proceedings of the International Conference on Computer Vision (ICCV). Piscataway, NJ, USA; 2009:2106-2113.

    Google Scholar 

  24. Li J, Xu D, Gao W: Removing label ambiguity in learning-based visual saliency estimation. IEEE Trans. Image Process 2012, 21(4):1513-1525.

    Article  MathSciNet  Google Scholar 

  25. Kountchev R, Nakamatsu K: Advances in Reasoning-Based Image Processing Intelligent Systems: Conventional and Intelligent Paradigms. New York: Springer; 2012.

    Book  Google Scholar 

  26. Le Meur O, Thoreau D, Le Callet P, Barba D: A spatio-temporal model of the selective human visual attention. In Proceedings of the International Conference on Image Processing (ICIP). Piscataway, NJ, USA; 2005:1-4.

    Google Scholar 

  27. You J, Liu G, Li H: A novel attention model and its application in video analysis. Appl. Math. Comput 2007, 185(2):963-975. 10.1016/j.amc.2006.07.023

    Article  Google Scholar 

  28. Bur A, Wurtz P, Miiri R, Hugli H: Dynamic visual attention: competitive versus motion priority scheme. In Proceedings of the International Conference on Computer Vision Systems. Bielefeld, Germany; 2007:1-10.

    Google Scholar 

  29. Li H, Xu L, Liu G: Two-layer average-to-peak ratio based saliency detection. Signal Process.: Image Commun 2013, 28(1):55-68. 10.1016/j.image.2012.10.004

    Google Scholar 

  30. Freeman W, Adelson E, Liu C, et al.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology (2009)

    Google Scholar 

  31. Brox T, Bruhn A, Papenberg N, Weickert J: High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision (ECCV). Prague, Czech Republic; 2004:25-36.

    Google Scholar 

  32. Bruhn A, Weickert J, Schnörr C: Lucas/kanade meets horn/schunk: combining local and global optical flow methods. Int. J. Comput. Vision 2005, 61(3):211-231.

    Article  Google Scholar 

  33. Fukuchi K, Miyazato K, Kimura A, Takagi S, Yamato J: Saliency-based video segmentation with graph cuts and sequentially updated priors. In Proceedings of the IEEE Conference on Multimedia and Expo (ICME). Piscataway, NJ, USA; 2009:638-641.

    Google Scholar 

  34. Seo H, Milanfar P: Static and space-time visual saliency detection by self-resemblance. J. Vis 2009, 9(12):1-27. 10.1167/9.12.1

    Article  Google Scholar 

  35. Yeh A: More accurate tests for the statistical significance of result differences. In Proceedings of the 18th Conference on Computational Linguistics. Saarbrücken, Germany; 2000:947-953.

    Chapter  Google Scholar 

  36. Courboulay V, Silva MPD: Real-time computational attention model for dynamic scenes analysis: from implementation to evaluation. In Optics, Photonics, and Digital Technologies for Multimedia Applications. Brussels, France; 2012:1-15.

    Google Scholar 

  37. Li H, Ngan KN, Liu Q: Faceseg: automatic face segmentation for real-time video. IEEE Trans. Multimed 2009, 11(1):77-88.

    Article  Google Scholar 

  38. Cheng M, Zhang G, Mitra N, Huang X, Hu S: Global contrast based salient region detection. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2011:409-416.

    Google Scholar 

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (No. 61179060) and by the grants from the Fundamental Research Funds for the Central Universities (No. ZYGX2012J019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Wu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wu, B., Xu, L., Zeng, L. et al. A unified framework for spatiotemporal salient region detection. J Image Video Proc 2013, 16 (2013). https://doi.org/10.1186/1687-5281-2013-16

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-5281-2013-16

Keywords