A unified framework for spatiotemporal salient region detection
© Wu et al.; licensee Springer. 2013
Received: 1 June 2012
Accepted: 12 March 2013
Published: 15 April 2013
This article presents a new bottom-up framework for spatiotemporal salient region detection. The generated saliency map can uniformly highlight the salient regions. In the proposed framework, the spatial visual saliency and the temporal visual saliency are first computed, respectively, then they are fused with a dynamic scheme to generate the final spatiotemporal saliency map. In the spatial attention model, the approach of joint embedding of spatial and color cues is adopted to compute the spatial saliency map. In the temporal attention model, we propose a novel histogram of average optical flow to measure the motion contrast of the different pixels. The method can suppress the motion noise efficiently because the statistical distribution of optical flow in a patch is comparatively stable. Furthermore, we combine the spatial and the temporal saliency maps through an adaptive fusion method, in which a novel motion entropy is proposed to evaluate the motion contrast of the input video. Extensive experiments demonstrate that our method can obtain higher quality saliency map compared with state-of-the-art methods.
Human visual system has an excellent ability to quickly catch salient information from complex scenes. The mechanism in the brain that determines which part of the visual data is currently of the most interest is called selective attention . The mechanism is critical for human to understand scenes. In recent years, many computational models have been proposed to mimic the mechanism of selective visual attention. The models can compute saliency maps from image or video inputs. The pixels with higher intensity values in saliency map denote that the corresponding pixels are visually important. The saliency map can be used for applications, such as object-of-attention segmentation [2–4], object detection [5, 6], image and video summarization , video surveillance , and image and video compression .
The existing methods of saliency detection can roughly be categorized into local and global methods. The local contrast-based methods compute the saliency of a specific image region based on it’s local neighborhoods [10–13]. The global contrast-based methods estimate the saliency of an image region by taking the contrast relations to the entire image into account [14–19]. In contrast to the weakness of the local methods which usually highlight the object boundary instead of the entire object, the global methods can generate saliency maps with full resolution and uniformly highlighted regions. However, only simple features are considered in calculating the motion saliency, such as flicker, which limits the models to the static background. If the salient object and background change simultaneously, the quality of the estimated saliency map is degraded rapidly.
In this article, we propose a novel unified framework which can detect salient regions in both images and videos flexibly. We consider the definition of visual saliency in the global approach. Thus, we assign more visual saliency to the features which are less frequent. Our approach which is extended from our previous work [20, 21] computes spatiotemporal saliency map based on a global scheme in spatial and temporal domain. Given a video, we first compute the spatial saliency map which adopts joint embedding of spatial and color cues. As for the temporal saliency, we compute the global motion contrast of dense optical flow. To suppress the motion noise, a new histogram of average optical flow (HOAOF) is proposed to compute the motion contrast of different pixels. Finally, a novel adaptive fusion technique is proposed to combine the spatial and the temporal saliency maps.
We propose a powerful unified framework for spatiotemporal salient region detection, which can obtain higher quality salient region detection results than existing methods no matter whether the camera is fixed or not.
A new HOAOF is proposed to compute the motion contrast in pixel-level. The descriptor can suppress the motion noise effectively because the statistical distribution of optical flow in a patch is comparatively stable.
We propose a novel adaptive fusion scheme to combine the spatial and the temporal saliency maps. The motion contrast of video sequence is measured by motion entropy. If a video has strong motion contrast, then the motion entropy of the video will be small. Correspondingly, the temporal attention model can be assigned a high weight, and vice versa.
The remainder of the article is organized as follows. In Section 2, the related work is discussed. The spatiotemporal salient region detection method is proposed and elaborated in Section 3. The experimental results and comparisons with other methods are provided in Section 4. In Section 5, we first discuss the connections between our approach and the related methods. Then, the limitation of the proposed method is also analyzed. Finally, the conclusion is presented in Section 6.
2 Related work
Visual attention can be determined by two categories of factors: bottom-up factors and top-down factors. The idea of bottom-up attention is to seek for the “visual pop-out” saliency. The salient signals are driven solely from the visual scene. On the contrary, both the cognitive factors and the high-level stimulus (e.g., face , person and car ) are considered in top-down attention. The approach expects latent correlations between visual attributes and saliency values and aims to mine such correlations from the training data . Since the data-driven stimuli are easier to control than the cognitive factors, and the exact interaction between bottom-up process and top-down process still remains elusive , bottom-up attention mechanisms are investigated more than top-down mechanisms.
The bottom-up models driven by low-level features can be classified into two schemes: local and global. The local contrast-based methods explore a salient feature depending on its neighborhoods. In the well-known bottom-up attention model , three basic low-level features (i.e., color, intensity, and orientation) are used to generate three conspicuity maps by computing the center-surround contrast. Then, the conspicuity maps are combined into a single saliency map. Based on , Itti et al.  extend the saliency detection model from the static scenes to the dynamic video clips by introducing two simple features: flicker and motion. The flicker is computed from the absolute difference between the luminance of consecutive frames. The motion is computed from spatially shifted differences between Gabor pyramids from the consecutive frames. Kim et al.  propose a novel method for spatiotemporal salient region detection. The approach combines the spatial saliency and the temporal saliency with fixed weight. For calculating temporal saliency, the authors simply compute the sum of absolute difference between the temporal gradients of the center and the surrounding regions.
A common limitation of the local scheme is that the generated saliency map usually produces high saliency values in the object boundary instead of the entire salient objects when only one scale is considered. In general, a multi-scale fusion scheme is used to alleviate the boundary-emphasize effect. The methods can obtain high equality motion saliency map when the camera is static. Once the camera is moving, the quality of the produced motion saliency map may be degraded rapidly. To overcome the problem, Le Meur et al.  apply motion contrast to compute the temporal saliency. You et al.  estimate the global motion to compensate camera’s motion and determine the video attention regions.
The global contrast-based methods integrate the entire information features all over the visual field. The approaches in [14, 17] calculate the saliency map based on the Fourier frequency spectrum. In , the difference between the original signal and the smooth one in the log amplitude spectrum is calculated, and then the saliency map is obtained by transforming the difference to the spatial domain. Guo and Zhang  use image’s phase spectrum of Fourier transform instead of amplitude spectrum to calculate the saliency map. Furthermore, the phase spectrum of quaternion Fourier transform (PQFT) is applied to detect the spatiotemporal saliency in the dynamic scenes. In Guo and Zhang’s model, intensity, color, and motion features are comprised into a quaternion image as an individual channel for taking phase spectrum. These features’ contribution is equivalent to each other in the final saliency map. However, the psychological studies reveal that the motion contrast usually attracts more human attention than other external signals .
3 The proposed method
In this section, we give a detailed description of our spatiotemporal salient region detection approach. We introduce the spatial attention model in Section 3.1. Then, we show how to calculate temporal saliency in Section 3.2. In Section 3.3, we show how to fuse spatial saliency map and temporal saliency map adaptively.
3.1 Spatial attention model
The spatial saliency map is computed from joint embedding of spatial and color cues. Three factors are considered to compute the individual saliency maps, respectively. The final spatial saliency map is generated by combing these maps in a two-layer saliency structure. Please refer to [20, 21] for more details about the spatial attention model.
3.1.1 Spatial constraint
3.1.2 Color double-opponent
3.1.3 Similarity distribution
where Z′ denotes the normalization factor, γ p,q ∈[0,1] measures the similarity between two pixels .
For the pixel p inside an object, π(p) can be approximated as the sum of distance to other pixels in the same object which is smaller. So it is more likely to assign pixels which belong to the same salient object large SD saliency values and vice versa. The obtained saliency map for Figure 1a using the SD saliency is shown in Figure 1d.
3.1.4 Two-layer fusion scheme
The SC saliency is employed as the basic layer.
The enhancement layer is designed based on the CD and SD saliency.
where the weight factors w 1 and w 2 regulate the extent of importance for the CD and the SD saliency. In our experiments, we set w 1=w 2=1.
In the two-layer saliency fusion scheme, the basic layer (SC) always works when the CD or the SD is either high or low. The enhancement layer (CD and SD) aims to attract more human attention when the CD contrast is strong or the SD is compact. As shown in Figure 1e,f, the saliency map constructed from two-layer structure highlights the two salient objects (i.e., pastry and plate) more uniformly than the pooling mechanism used in .
3.2 Temporal attention model
In the temporal attention model, temporal saliency maps are often calculated by temporal gradient which is computed by using the intensity difference between successive frames. The models work well when the camera is static. Once the camera moves, the evaluated saliency maps will incorporate much noise. In this study, we find that the object in video sequences exhibiting high motion saliency usually has the following properties: (a) there are clear motion patterns in the scene; (b) the motion of object exhibits difference from the global motion of the scene; (c) compared with the size of the scene, the object is relatively small.
where V p and V q are the optical flow of pixels p and q in frame I, respectively, D(V p ,V q ) is the vector difference between the optical flow of pixels p and q, and |.| represents magnitude of vector.
Finally, the temporal saliency map is normalized to a fixed range [0,1].
An example is shown in Figure 2. It is clear that the temporal attention model can suppress the background noise efficiently. In Figure 2a, the camera is fixed and the global motion is nearly static. Compared with the background, the moving pedestrian produces a high-salient region in the frame. In Figure 2b, the camera tracks a pedestrian such that the person has small optical flow, while the background has large motion. In this case, the direction of global motion is opposite to that person. A clearly saliency map can still be obtained by using our model. If a dynamic scene has strong motion contrast, the main motion will be gathered in few directions and the motion object will pop out explicitly.
3.3 Adaptive fusion
We have obtained the spatial and the temporal saliency maps separately. The two maps need to be fused in a meaningful way to generate the final spatiotemporal saliency map. It is shown in  that the human vision system is more sensitive to motion information compared with the static signals. In a dynamic scene, the camera is tracking a pedestrian, while the motion direction of background is opposite to the camera’s movement. In general, people are more interested in the followed person instead of his surrounding regions. In surveillance video, the camera is fixed and the moving objects in video attract more human attention than the static background. In these examples, motion contrast is the prominent feature for the saliency detection compared with other features, such as intensity, texture, and color. In contrast, if the motion of the video is cluttered or the motion contrast is insignificant, human attention is attracted more to the contrasts caused by the static visual stimuli. Thus, simple linear combination with fixed weights between the spatial saliency map and the temporal saliency map may lead to unsatisfactory result. Instead, we adopt an adaptive fusion scheme, which is consistent with the above considerations. The adaptive fusion scheme can give higher weight to the temporal saliency map when strong motion contrast is present in the dynamic scene. In contrast, a higher weight is assigned to the spatial saliency map when the motion contrast is weak.
where L is the number of bins. The parameter h i is the value of i th bin in HOAOF. The more cluttered the distribution of motion direction in video frame is, the larger the entropy is, and vice versa.
4 Experimental results
In this section, we first introduce the datasets used for performance evaluation. Then, we compare the proposed method with three state-of-the-art methods [11, 13, 14] and provide the qualitative and quantitative results, respectively.
4.1 Video sequences datasets
The performance of the proposed algorithm is evaluated extensively on two types of videos, named Video Set 1 and Video Set 2, respectively. Video Set 1 contains surveillance videos, which are collected from PETS2001.b There are 6,000+ images totally. In this dataset, the camera is fixed and the background is still. People’s attention is mainly attracted to the moving objects , such as the pedestrian and the moving car. The examples of frames are shown in Figure 4. Since the size of the moving object is small, we use the bounding boxes of the moving objects as the ground truth. We collect Video Set 2 with 60 video clips from the Internet and the video segmentation datasets . Each video clip contains about 60–200 frames with the same salient objects. There are 6,000+ images totally. Different from Video Set 1, the camera in this dataset is moving or the background presents clutter motion when the camera is still. It means that the objects and the background of the scenes are moving. Since the size of salient objects in Video Set 2 is large, the annotated ground truth masks are object-contour based.
4.2 Performance evaluation
In Figure 4a, we show the representative frames of Video Set 1, as well as the individual saliency detection results of the proposed method in different stages. Figure 4b is the computed spatial saliency map. Figure 4c is the temporal saliency map. The fused spatiotemporal saliency map is presented in Figure 4d. It is seen from the figure that the spatial saliency map does not highlight the salient object successfully. The main reason is that the scene has the highly texture background and the static features of the small foreground objects are not significantly distinctive. However, compared with the still background, the moving foreground objects have strong motion contrast. This leads to a temporal saliency map which can detect the moving salient objects clearly. In our adaptive fusion scheme, the strong motion contrast results in the dominant contribution of temporal attention model in the final spatiotemporal saliency map. As shown in Figure 4d, the effect of spatial saliency map is negligible. Another example in Video Set 2 is presented in Figure 5. The video records a bird by a fixed camera in the wild where the branches of the background present clutter motion. The motion contrast in the scene is weak, so that the spatial saliency is dominant over the temporal saliency in the adaptive fusion scheme. The spatial, the temporal, and the spatiotemporal saliency maps are shown in Figure 5b–d, respectively.
with w t =e−α E , α is a constant factor which adjust the weight. In our experiments, we set α=0.15. Our model can also deal with static images easily by setting w t =0.
Furthermore, the proposed model can be employed to extract the salient objects from the video sequences by thresholding the spatiotemporal saliency map via a moderate threshold. To this end, a non-parametric significance testing is adopted .We compute the empirical PDF from all the saliency values and set a threshold to achieve 95% confidence level in deciding whether the given values are in the extremely right tails of the estimated distribution. In addition to the comparison between the methods in [13, 14], we also compare the proposed method with Liu et al.’s model , which is a salient object detection method.In , a group of static and dynamic saliency features are computed and the optimal linear weights are learned through CRF learning method. Given an image pair, the model outputs a binary label map, which is further transformed to a bounding rectangle representing the salient object. In order to facilitate comparison, we take the binary label map of  as the detection result.In Video Set 1, the background is static, which is different from Video Set 2. Training in two video sets together can degrade the overall performance of salient object detection. So, we train CRFs in two datasets separately. In Video Set 1, the surveillance video is divided into 60 video segments. We randomly select 40 video segments with 2,000+ image pairs to construct a training set, and use the others for testing. In Video Set 2, we randomly select 40 video segments with 2,000+ image pairs to construct a training set, and use the others for testing.
Performance evaluation for salient object extraction
In this section, we first discuss the difference between our approach for salient region detection and other saliency detection models similar with Itti et al.’s model . Furthermore, we discuss the limitations and implement failure analysis of the proposed method.
Salient region versus visual saliency
Salient region detection is different from the visual saliency computation in [10, 36] or other based on the biologically plausible computational models of attention. Itti et al.’s model  and those similar to it usually focus on mimicking the properties of vision and predicting eye fixations. The resulting saliency maps are often overemphasize small, purely local features, and fail to detect the internal part of the target, which makes the approach less useful for applications, such as segmentation and detection. This kind of model is usually evaluated by comparing the saliency map with the real human attention density map. Salient region detection method is part of the computational approach which is inspired by the biological theory, but is closely related to the typical applications in computer vision, such as adaptive content delivery, adaptive region-of-interest-based image compression, salient object segmentation , and object recognition. The resulted saliency map can uniformly highlight the entire salient regions in scenes. This kind of model is usually evaluated by comparing the resulted saliency map with the manually labeled binary ground-truth mask, such as [18, 38].
In this article, we propose a novel spatiotemporal salient region detection framework based on global scheme. The saliency maps are calculated separately by using the static and motion information of the videos. In the spatial attention model, we adopt joint embedding of spatial and color cues. The pixel-level saliency map is computed by using three components which are SC, CD, and SD. In the temporal attention model, the dense optical flow is used to calculate the global motion contrast of object in dynamic scene. To suppress the produced noise while estimating optical flow, a novel HOAOF is proposed to measure the motion contrast. To achieve the final spatiotemporal saliency map, an adaptive fusion scheme is adopted to combine the spatial and the temporal saliency. The dynamic weights of the two individual components are controlled by the motion entropy of the video frames. Extensive experiments show that the proposed method can obtain higher quality salient region detection results than existing methods no matter whether the camera is fixed or not.
This study was supported by the National Natural Science Foundation of China (No. 61179060) and by the grants from the Fundamental Research Funds for the Central Universities (No. ZYGX2012J019).
- Frintrop S, Rome E, Christensen H: Computational visual attention systems and their cognitive foundations: a survey. ACM Trans. Appl. Perception (TAP) 2010, 7(1):1-39.View ArticleGoogle Scholar
- Meng F, Li H, Liu G, Ngan K: Object co-segmentation based on shortest path algorithm and saliency model. IEEE Trans. Multimed 2012, 14(5):1429-1441.View ArticleGoogle Scholar
- Jung C, Kim C: A unified spectral-domain approach for saliency detection and its application to automatic object segmentation. IEEE Trans. Image Process 2012, 21(3):1272-1283.MathSciNetView ArticleGoogle Scholar
- Li H, Ngan K: Saliency model-based face segmentation and tracking in head-and-shoulder video sequences. J. Visual Commun. Image Represent 2008, 19(5):320-333. 10.1016/j.jvcir.2008.04.001View ArticleGoogle Scholar
- Gao D, Vasconcelos N: Integrated learning of saliency, complex features, and object detectors from cluttered scenes. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR), Vol. 2. Los Alamitos, CA, USA; 2005:282-287.Google Scholar
- Li H, Ngan K: A co-saliency model of image pairs. IEEE Trans. Image Process 2011, 20(12):3365-3375.MathSciNetView ArticleGoogle Scholar
- Cheng W, Wang C, Wu J: Video adaptation for small display based on content recomposition. IEEE Trans. Circuits Syst. Video Technol 2007, 17(1):43-58.MathSciNetView ArticleGoogle Scholar
- Mahadevan V, Vasconcelos N: Background subtraction in highly dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2008:1-6.Google Scholar
- Liu K: Prediction error preprocessing for perceptual color image compression. EURASIP J. Image Video Process 2012, 2012(1):1-14. 10.1186/1687-5281-2012-1View ArticleGoogle Scholar
- Itti L, Koch C, Niebur E: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(11):1254-1259. 10.1109/34.730558View ArticleGoogle Scholar
- Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H: Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell 2011, 33(2):353-367.View ArticleGoogle Scholar
- Itti L, Dhavale N, Pighin F: Realistic avatar eye and head animation using a neurobiological model of visual attention. In SPIE. San Diego, CA, USA; 2003:64-78.Google Scholar
- Kim W, Jung C, Kim C: Spatiotemporal saliency detection and its applications in static and dynamic scenes. IEEE Trans. Circuits Syst. Video Technol 2011, 21(4):446-456.MathSciNetView ArticleGoogle Scholar
- Guo C, Zhang L: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process 2010, 19(1):185-198.MathSciNetView ArticleGoogle Scholar
- Luo W, Li H, Liu G, Ngi Ngan K: Global salient information maximization for saliency detection. Signal Process.: Image Commun 2011, 27(3):238-248.Google Scholar
- Zhai Y, Shah M: Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th annual ACM international conference on Multimedia. New York, NY, USA; 2006:815-824.View ArticleGoogle Scholar
- Hou X, Zhang L: Saliency detection: a spectral residual approach. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2007:1-8.Google Scholar
- Achanta R, Hemami S, Estrada F, Susstrunk S: Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2009:1597-1604.Google Scholar
- Li H, Ngan K: Unsupervised video segmentation with low depth of field. IEEE Trans. Circuits Syst. Video Technol 2007, 17(12):1742-1751.View ArticleGoogle Scholar
- Xu L, Li H, Wang Z: Saliency detection from joint embedding of spatial and color cues. In 2012 IEEE International Symposium on Circuits and Systems (ISCAS). Piscataway, NJ, USA; 2012:2673-2676.View ArticleGoogle Scholar
- Xu L, Li H, Zeng L, Ngan KN: Saliency detection using joint spatial-color constraint and multi-scale segmentation. J. Visual Commun. Image Represent 2013, 24(4):465-476. 10.1016/j.jvcir.2013.02.007View ArticleGoogle Scholar
- Cerf M, Harel J, Einhäuser W, Koch C: Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems. New York, NY, USA; 2008:241-248.Google Scholar
- Judd T, Ehinger K, Durand F, Torralba A: Learning to predict where humans look. In Proceedings of the International Conference on Computer Vision (ICCV). Piscataway, NJ, USA; 2009:2106-2113.Google Scholar
- Li J, Xu D, Gao W: Removing label ambiguity in learning-based visual saliency estimation. IEEE Trans. Image Process 2012, 21(4):1513-1525.MathSciNetView ArticleGoogle Scholar
- Kountchev R, Nakamatsu K: Advances in Reasoning-Based Image Processing Intelligent Systems: Conventional and Intelligent Paradigms. New York: Springer; 2012.View ArticleGoogle Scholar
- Le Meur O, Thoreau D, Le Callet P, Barba D: A spatio-temporal model of the selective human visual attention. In Proceedings of the International Conference on Image Processing (ICIP). Piscataway, NJ, USA; 2005:1-4.Google Scholar
- You J, Liu G, Li H: A novel attention model and its application in video analysis. Appl. Math. Comput 2007, 185(2):963-975. 10.1016/j.amc.2006.07.023View ArticleGoogle Scholar
- Bur A, Wurtz P, Miiri R, Hugli H: Dynamic visual attention: competitive versus motion priority scheme. In Proceedings of the International Conference on Computer Vision Systems. Bielefeld, Germany; 2007:1-10.Google Scholar
- Li H, Xu L, Liu G: Two-layer average-to-peak ratio based saliency detection. Signal Process.: Image Commun 2013, 28(1):55-68. 10.1016/j.image.2012.10.004Google Scholar
- Freeman W, Adelson E, Liu C, et al.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology (2009)Google Scholar
- Brox T, Bruhn A, Papenberg N, Weickert J: High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision (ECCV). Prague, Czech Republic; 2004:25-36.Google Scholar
- Bruhn A, Weickert J, Schnörr C: Lucas/kanade meets horn/schunk: combining local and global optical flow methods. Int. J. Comput. Vision 2005, 61(3):211-231.View ArticleGoogle Scholar
- Fukuchi K, Miyazato K, Kimura A, Takagi S, Yamato J: Saliency-based video segmentation with graph cuts and sequentially updated priors. In Proceedings of the IEEE Conference on Multimedia and Expo (ICME). Piscataway, NJ, USA; 2009:638-641.Google Scholar
- Seo H, Milanfar P: Static and space-time visual saliency detection by self-resemblance. J. Vis 2009, 9(12):1-27. 10.1167/9.12.1View ArticleGoogle Scholar
- Yeh A: More accurate tests for the statistical significance of result differences. In Proceedings of the 18th Conference on Computational Linguistics. Saarbrücken, Germany; 2000:947-953.View ArticleGoogle Scholar
- Courboulay V, Silva MPD: Real-time computational attention model for dynamic scenes analysis: from implementation to evaluation. In Optics, Photonics, and Digital Technologies for Multimedia Applications. Brussels, France; 2012:1-15.Google Scholar
- Li H, Ngan KN, Liu Q: Faceseg: automatic face segmentation for real-time video. IEEE Trans. Multimed 2009, 11(1):77-88.View ArticleGoogle Scholar
- Cheng M, Zhang G, Mitra N, Huang X, Hu S: Global contrast based salient region detection. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR). Piscataway, NJ, USA; 2011:409-416.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.