Open Access

A hierarchical graph model for object cosegmentation

EURASIP Journal on Image and Video Processing20132013:11

DOI: 10.1186/1687-5281-2013-11

Received: 9 June 2012

Accepted: 2 February 2013

Published: 26 February 2013


Given a set of images containing similar objects, cosegmentation is a task of jointly segmenting the objects from the set of images, which has received increasing interests recently. To solve this problem, we present a novel method based on a hierarchical graph. The vertices of the hierarchical graph involve pixels, superpixels and heat sources, and cosegmentation is performed as iterative object refinement in the three levels. With the inter-image connection in the heat source level and the intra-image connection in the superpixel level, we progressively update the object likelihoods by transferring message across images via belief propagation, diffusing heat energy within individual image via random walks, and refining the foreground objects in the pixel level via guided filtering. Besides, a histogram based saliency detection scheme is employed for initialization. We demonstrate experimental evaluations with state-of-the-art methods over several public datasets. The results verify that our method achieves better segmentation quality as well as higher efficiency.


Cosegmentation Hierarchical graph Heat source Saliency detection Belief propagation Random walks Guided filtering

1 Introduction

The term “cosegmentation” is first introduced by Rother et al. [1] in 2006, referring to the problem of simultaneously segmenting “similar” foreground objects in a set of images. The definition of “similar” commonly indicates the constraint that the distribution of some appearance cues such as color and texture in each image has to be similar. Cosegmentation has many potential applications. It can be used for summarizing personal photo album, guiding multiple images’ editing, boosting unsupervised object recognition, improving content based image retrieval and so on.

Since the introduction of the problem, various methods have been presented. One type of methods handles the problem of multi-class cosegmentation, while others focus on binary cosegmentation. In this article, we are interested in binary cosegmentation and observe that for most applications of binary cosegmentation several criteria should be followed: (1) automation, i.e., it is executed without user interactions; (2) scalability, i.e., it can be applied to hundreds of images instead of two images or small sized image sets; (3) focusing on “object” instead of “stuff”. Here the “object” refers to “foreground things” such as a person or a bird, while “stuff” refers to “background regions” such as road or sky; (4) high segmentation accuracy; (5) low running time. According to these criteria, existing methods have some limitations. For example, the iCoseg system presented by Batra et al. [2] can obtain highly accurate results, but requires user input. The methods reviewed by Vicente et al. [3] all focus on cosegmenting two images. The recently presented CoSand [4] only extracts similar large regions, thus it often omits the small foreground objects in the images. Methods based on topic discovery like [57] all take superpixels as computation nodes, and hence they suffer from detail loss because superpixels tend to merge foreground regions with the backgrounds. Some unsupervised object segmentation methods [811] extract objects from multiple images via iteratively learning class models and segmenting objects in pixel level, while they are time-consuming because the employed optimization schemes like graphcut [12] and belief propagation [13] are inefficient with a large number of pixel nodes.

In this article, we try to meet these criteria by extracting the foreground objects with a three-level hierarchical graph model. As shown in Figure 1, the graph model is composed of the pixel, superpixel and heat source levels, in which superpixels are grouping units of pixels obtained by an over-segmentation method [14] and heat sources are the representative superpixels obtained by a bottom-up agglomerative clustering scheme. The term “heat source” is introduced in random walks [15], representing heat energy convergence points. Here, we adopt it to describe message transferring among images and heat energy diffusion within individual image. The iterative object refinement is operated at the three levels with different optimization schemes. The heat source level utilizes belief propagation [13] for message transferring. In the superpixel level, random walks [15] is employed for heat energy diffusion. In the pixel level, we refine the foreground objects within each image via guided filtering [16]. By doing so, the foreground objects are gradually extracted. Besides, we employ a histogram based saliency detection method [17] for initializing the object likelihoods.
Figure 1

An illustration of the hierarchical graph model for cosegmentation. The graph model is composed of the pixel, superpixel and heat source levels. The cosegmentation method is performed by message transferring among images in the heat source level, heat energy diffusion in the superpixel level and local refinement in the pixel level.

It is no doubt that our method is automatic and has the following advantages. (1) It is scalable. Since the superpixel and pixel levels both treat each image separately, and the heat source level’s integration only operates on limited heat sources, this method has high parallelization capacity and can be easily applied to large scale image collection. (2) It focuses on “object” instead of “stuff”. This is because our method is initialized by saliency detection, which can filter out background stuff. (3) It is computationally more efficient. Compared with methods [8, 9, 18] which perform message transferring among images using a large number of superpixels or pixels, our method uses a small number of heat sources and thus significantly reduce computation time. (4) It can preserve object boundaries. This method finally refines object segmentation in the pixel level, and hence avoids the problem of detail loss existing in other superpixel based methods.

The remainder of this article is organized as follows. After summarizing the related study in Section 2, we present the hierarchical graph model in Section 3. The stages of object refinement along the model, including foreground initialization, local object refinement, message transferring and heat energy diffusion are described in Section 4. Experimental results are demonstrated in Section 5, and we conclude the article in the last section.

2 Related work

Basically, the solutions to cosegmentation can be roughly classified into two categories: clustering based methods [57, 19] and labeling based methods [3, 811, 18]. The former tries to partition nodes (pixels or superpixels) in the images into distinct, semantically coherent clusters, while the latter aims at assigning each node with a unique label.

2.1 Clustering based methods

Under the assumption that similar objects often recur in multiple images, clustering based methods employ clustering models to discover such frequent regions. The well-known clustering models include topic discovery models like probabilistic latent semantic analysis (PLSA) [20], and geometry based models like normalized cuts (NCut) [21]. Motivated by the success of topic discovery in text analysis, Russell et al. [5] first adopt PLSA to address the cosegmentation problem. Later, Cao et al. [6] and Zhao et al. [7] both present spatially coherent topic models to encode the spatial relationship of image patches which is ignored by the traditional topic models. Combining NCut and supervised classification technique, Joulin et al. [19] utilize a discriminative clustering scheme to tackle the cosegmentation problem. For speeding up, all clustering based methods take superpixels as computation nodes. The major limitation of these methods is the lower segmentation accuracy caused by the over-segmentation methods.

2.2 Labeling based methods

Considering the Markov property in the images, labeling based methods formulate cosegmentation as a Markov random field (MRF) energy minimization problem. Over the past decade, methods that use graphcut [12] to minimize MRF energy have become the standard for figure-ground separation.

One technique is to minimize an energy function that is a combination of a pairwise MRF energy and a histogram matching term. The histogram matching terms such as L 1 norm model [1], L 2 norm model [22] and “reward” model [23] force foreground histograms between a pair of images to be similar. Vicente et al. [3] review these models and make a comparison. Yet these methods are limited to two images. Another technique, also called unsupervised object segmentation such as LOCUS [8], ClassCut [9], Arora et al. [10] and Chen et al. [11], performs object cosegmentation by iteratively learning the object geometric models and segmenting the foreground objects. The initialization stages of these methods play an important role for energy minimization. For example, LOCUS [8] takes the pre-trained mask and edge probability maps as the initial object models, ClassCut [9] uses a general object detector [24] to locate objects. However, these methods are limited to segmenting objects with similar geometric shape. In contrast, the recently proposed cosegmentation method—BiCos [18] is more general and can be applicable for any non-rigid objects. BiCos [18] operates at the two levels: the bottom level treats each image separately and uses graphcut [12] to refine foreground objects in pixel level, whereas the top level takes superpixels as computation units and employs a discriminative classification to propagate information among images.

Our method falls into the last category. The main idea is to combine multiple schemes along a three-level hierarchical graph to refine foreground objects successively. In contrast to other labeling based methods [3, 811, 18], this method has the following characteristics: (1) utilization of heat sources for message propagation among images, which can significantly reduce computation time; (2) a saliency detection based initialization, which can remove the impact of background stuff; (3) instead of using graphcut [12] to refine objects in the pixel level, we introduce guided filtering [16] for local refinement. In experiments, we compare our method quantitatively and qualitatively with other state-of-the-art methods over several public datasets. As a outcome, our method achieves better segmentation quality as well as lower computation time.

3 The hierarchical graph model

3.1 Problem formulation

Given a set of images containing objects of the same class, I = { I k , k = 1 , , K } , the goal of cosegmentation is to simultaneously extract the foreground objects. We formulate this problem as a binary labeling: = { L k , k = 1,…,K}, which assigns each pixel x in the image I k with a label L k (x). L k (x) = 0 indicates x belongs to the background, whereas L k (x) = 1 to the foreground. The best labeling follows maximum a posteriori estimation, i.e., = arg max p ( | I ) . Based on the Bayesian perspective, p ( | I ) p ( ) p ( I | ) , where p ( ) is the labeling prior and p ( I | ) is the observation likelihood. Under the assumption that the prior follows uniform distribution and the observation likelihood is pair-wise dependent among images, the posteriori can be rewritten as:
p ( | I ) k p ( I k | L k ) ( k 1 , k 2 ) p ( I k 1 , I k 2 | L k 1 , L k 2 )
The corresponding energy function (i.e., E(x) = -logp(x)) is:
E ( | I ) = k E d ( I k | L k ) + ( k 1 , k 2 ) E s ( I k 1 , I k 2 | L k 1 , L k 2 )
The energy function combines the unary terms E d (·) and the pairwise terms E s (·,·). In our study, the unary term is composed of two parts:
E d ( I k | L k ) = E d 1 ( I k | L k ) + E d 2 ( I k | L k , θ k )

where E d 1 ( I k | L k ) is derived from saliency detection, and E d 2 ( I k | L k , θ k ) is inferred under the guide of an inherent object model. θ k is the latent parameter set for the object model of I k .

The pairwise term can be considered as a smooth term, which penalizes the inconsistent labeling among images. Ideally, this term should be formulated in the pixel level. For computational efficiency, we define it in the heat source level using appearance information (see Equation (8)). Minimizing the above energy with respect to all discrete labels is intractable. Instead, we relax the labels firstly, i.e., let L k (x) [0,1] be the object likelihood, and iteratively update them along a hierarchical graph model, finally obtain the segmentation results by rounding.

3.2 The hierarchical graph and our method

As shown in Figure 1, the graph model is composed of three types of nodes: pixels, superpixels and heat sources. For each image, superpixels are the clustering units of coherent pixels, and heat sources are the representative superpixels located in the centers of the clustering regions formed by coherent superpixels. In our implementation, the superpixels are extracted by an over-segmentation method—Turbopixels [14]. The generation of heat sources will be described in detail in Section 4.2.

Based on the graph model, our method successively updates the object likelihoods by the following iteration: (1) estimating the latent parameters and refining object segmentation, (2) transferring message among images and diffusing heat energy within individual image. Specifically, we first obtain the object likelihoods in each image with saliency detection [17], and then estimate the latent parameters to update the object likelihoods. The likelihoods of the heat sources are further updated among images via message transferring which is fulfilled by belief propagation [13], and diffused to other superpixels using random walks [15] within individual image. Now the likelihoods can be considered as input for further iteration. In the following sections, we denote the updated object likelihoods at different stages by L k , t , t = 0 , , 3 , k=1,…,K. To summarize the cosegmentation method presented in this article, we provide a high level overview of the method pipeline asfollows.

  • Input: a set of images containing objects of the same class
    I = { I k , k = 1 , , K }
  • Output: the cosegmentation results with the form of binary labeling = { L k , k = 1 , , K } Step 1. Initialization (Section 4.1) a) partition each image I k into a set of superpixels S k and extract heat sources Z k . b) obtain the initial object likelihoods L k , 0 via saliency detection [17]. c) estimate the latent parameter set θ k . d) acquire the updated object likelihoods L k , 1 via guided filtering [16]. Step 2. Global message transferring (Section 4.2)

  • Optimize the energy function defined in Equation (6) via belief propagation [13] to provide the updated object likelihoods L,2(Z) for all heat sources. Step 3. Local heat energy diffusion (Section 4.3)

  • For each image I k , the object likelihoods of the heat sources L k , 2 ( Z k ) are diffused to other superpixels U k  = S k -Z k via random walks [15], obtaining L k , 2 ( U k ) . Step 4. Local object refinement (Section 4.1) a) let L k , 3 = ( L k , 0 + L k , 1 + L k , 2 ) / 3 . b) re-estimate the latent parameter set θ k . c) acquire the updated object likelihoods L k , 1 via guided filtering [16]. Step 5. Repeat Step 2, 3, and 4 until convergence. The final labeling L k is obtained by binarizing L k , 3 .

4 Hierarchical graph based object cosegmentation

4.1 Initialization and local refinement

One major visual characteristic of objects is that they often stand out as saliency [24]. Based on this characteristic, we apply saliency detection to initially detect foreground regions in each image. Over various of saliency detection methods, we choose a recently proposed histogram based method [17] for its efficiency and effectiveness. Figure 2b demonstrates the saliency detection result of Figure 2a. We define the initial object likelihoods L k , 0 as the saliency likelihoods.
Figure 2

Saliency detection based model initialization. (a) The input image, (b) the saliency detection result, (c) the segmentation result built on GMM, and (d) the segmentation result obtained after guided filtering.

The segmentation results obtained by thresholding saliency likelihoods often contain holes and ambiguous boundaries. Motivated by the interactive segmentation methods, e.g., GrabCut [25], we utilize the inherent color Gaussian mixture model (GMM) in the image to update the object likelihoods. Two GMMs, one for the foreground and another for the background, are estimated in RGB color space. Each GMM is taken to be a full-covariance Gaussian mixture with M components. The GMM parameters are defined as: θ k  = {θ k J|J {B,F}}, in which θ k J = { θ m , k J | m = 1 , , M } , θ m , k J = ( μ m , k J , Σ m , k J , ω m , k J ) . ( μ m , k F , Σ m , k F , ω m , k F ) are the mean, covariance and weighting values for the foreground components, and ( μ m , k B , Σ m , k B , ω m , k B ) for the background components. The GMM parameters are estimated from the initial likelihoods as follows: (1) given two thresholds T 1 and T 2, satisfying 0 < T 1 < T 2 < 1, we label the pixels with L k , 0 ( x ) > T 1 as foreground, whereas L k , 0 ( x ) < T 2 as background; (2) the colors of the foreground and background regions are clustered into M components using K-Means [26], respectively; (3) for each component, we statistically acquire its parameters θ m , k J . The object likelihoods built on the GMMs are given by:
p ( I k ( x ) | θ k J ) = max m ( p ( I k ( x ) | θ m , k J ) )
p ( I k ( x ) | θ m , k J ) = ω m , k J exp ( - I k ( x ) - μ m , k J / Σ m , k J ) / | Σ m , k J |

Segmenting objects by directly thresholding the updated object likelihoods will result in noises, as shown in Figure 2c. We use guided filtering [16] to remove noises. The main idea of guided filtering [16] is that, given the filter input p, the filter output q is locally linear to the guidance map I, q i  = a x I i  + b x , iw x , where w x is a window with radius r centered at the pixel x. By minimizing the difference between the filter input p and the filter output q, i.e., Err ( a x , b x ) = Σ i w x ( ( p i - q i ) 2 + ϵ a x 2 ) , we can obtain a x , b x and the filter output q.

Based on guided filtering [16], we perform local refinement with three steps: (1) obtaining the foreground likelihood map L k,F ={p(I k (x)|θ k F)} and the background likelihood map L k,B  = {p(I k (x)|θ k B)}; (2) taking the grayscale image of I k as the guidance map, the two likelihood maps are filtered, respectively (denoting the filter outputs as L ̂ k , F and L ̂ k , B ); (3) defining the refined object likelihoods as L k , 1 = L ̂ k , F / ( L ̂ k , F + L ̂ k , B ) . Figure 2d shows the refinement result of Figure 2c. As can be seen, the guided filtering based scheme can significantly improve segmentation quality.

4.2 Global message transferring

Due to the diversity of realistic scenes, saliency based object segmentation sometimes fails to extract objects of the same class (see Figure 3c). The segmentation quality can be further boosted by sharing appearance similarity among images. Unlike other cosegmentation methods [8, 9, 18] which propagate the distributions of visual appearance in the pixel or superpixel level, we perform message propagation in the heat source level to reduce computation time.
Figure 3

The segmentation results obtained before and after message transferring. (a) The input images, (b) the saliency detection results, (c) the segmentation results obtained in the initial stage, and (d) the segmentation results obtained after message transferring.

As stated in Section 3, heat sources are the representative superpixels located in the centers of the clustering regions formed by coherent superpixels. The regions are formed by a bottom-up agglomerative clustering scheme. Specifically, given an image I, we first partition it into a collection of superpixels via Turbopixels [14] (see Figure 4b, in which superpixels are encircled with red boundaries). Then we build an intra-image graph G S  = <S,Y S  >, where S = {s i } is the superpixel set and Y S  = {(s i ,s j )} is the edge set connecting all pairs of adjacent superpixels. The edge weight is defined by Gaussian similarity between the normalized mean RGB color of the nodes, i.e., w(s i ,s j ) = exp(-I(s i )-I(s j )2)/σ s , where σ s is a variance constant. Based on the graph G S , we use a greedy scheme to merge nodes one by one. Each time, we select the edge with the maximum weight value and merge its two nodes. This step is repeated until all nodes are merged into N regions. The central superpixel of each region is chosen as a heat source. Figure 4c demonstrates the clustering regions overlaid by the heat sources, in which the regions are encircled with green boundaries and the heat sources are colored in blue.
Figure 4

An example of extracting superpixels and heat sources from an input image. (a) The input image, (b) the superpixels extracted by Turbopixels [14] are encircled with red boundaries, and (c) the regions extracted by an agglomerative clustering scheme are encircled with green boundaries, and the extracted heat sources are colored in blue.

For message transferring among images, we construct an inter-image graph G Z  = <Z,Y Z  >. G Z is an undirected complete graph, where Z = {z i |z i Z k ,k = 1,…,K} includes all heat sources from the input images, Y Z  = {(z i ,z j )} connects all pairs of heat sources. We update the object likelihoods of the heat sources by minimizing a standard MRF energy function that is the sum of unary terms E 1(·) and pairwise terms E 2(·,·):
E ( L ( Z ) ) = z i Z E 1 ( z i ) + λ ( z i , z j ) Y Z E 2 ( z i , z j )

where λ is the weighting value balancing the trade off between the unary terms and the pairwise terms.

The unary term E 1(·) imposes individual penalties for assigning any likelihood L(z i ) to the heat source z i . We rely on the object likelihoods L,1 acquired in the previous stage to define this term:
E 1 ( z i ) = L ( z i ) - x z i L , 1 ( x ) / | z i |
The pairwise term E 2(·,·) defines to what extent adjacent heat sources should agree. It often depends on local observation. In our study, the pairwise potential takes the form:
E 2 ( z i , z j ) = w ( z i , z j ) | L ( z i ) - L ( z j ) |

where w(z i ,z j ) is the edge weight, defined as w(z i ,z j )= exp(-f(z i )-f(z j )2)/σ z , σ z is a variance constant. f(z) is a nine-dimensional descriptor for the heat source z, including three-dimensional mean Lab color feature, four-dimensional mean texture featurea and two-dimensional mean position feature. This definition suggests that the larger the weight for the edge, the more similar the labels for its two nodes.

We utilize belief propagation [13] to optimize the energy function in several bounds. The main idea of belief propagation is to iteratively update a set of message maps between neighboring nodes. The message maps that are denoted by { m z i z j t ( L ( z j ) ) , t = 1 , , T } represent the transferred message from one node to another at each iteration. In our study, the message maps are initially set to zero and updated as follows:
m z i z j t ( L ( z j ) ) = min L ( z i ) E 1 ( z i ) + λ E 2 ( z i , z j ) + z k Z / z j m z k z i t - 1 ( L ( z i ) )

Finally, a belief vector is computed for each node, b z i ( L ( z i ) ) = E 1 ( z i ) + z j Z m z j z i T ( L ( z i ) ) , and the updated object likelihoods are expressed as: L , 2 ( z i ) = b z i ( 0 ) / ( b z i ( 0 ) + b z i ( 1 ) ) .

4.3 Local heat energy diffusion

After global message transferring, the object likelihoods for heat sources preserve appearance similarity among images. We further diffuse them to other superpixels. As illustrated in the middle level of Figure 1, this is performed by heat energy diffusion within individual image. The heat energy diffusion can be imagined in the following situation: putting some heat sources in a metal plate, the heat energy will diffuse to other points as time goes by, finally each point will have a stable temperature. How to calculate such steady-state temperatures? This is a well-known Dirichlet energy minimization problem:
u = arg min u ( E ( u ) ) = arg min u 1 2 u Ω | u | 2
Grady [15] states the similar problem in discrete space with the term “random walks”. Based on a graph G X  = <X,Y X >, where X = {x i } is the node set and Y X  = {(x i ,x j )} is the set of node pairs, the Dirichlet energy function takes the form:
E ( u ( X ) ) = 1 2 ( x i , x j ) Y X w ( x i , x j ) ( u ( x i ) - u ( x j ) ) 2

where w(x i ,x j ) is the edge weight for the adjacent node pair (x i ,x j ).

In our study, the random walks works on the graph G k S = <S k ,Y k S> for the image I k , where S k  = {s i } is the superpixel set and Y k S = {(s i ,s j )} is the edge set connecting all pairs of adjacent superpixels. The corresponding energy function is:
E ( L ( S k ) ) = 1 2 ( s i , s j ) Y k S w ( s i , s j ) ( L ( s i ) - L ( s j ) ) 2 = 1 2 L ( S k ) T QL ( S k )

where Q = D-A is the Laplacian matrix, in which A = {w(s i ,s j )} is the edge weight matrix, and D is a diagonal matrix with the entities D ( s i , s i ) = j w ( s i , s j ) .

We divide the node set S k into two parts: the heat sources Z k and the superpixels U k  = S k -Z k . The energy function can be rewritten as:
E ( L ( S k ) ) = L ( Z k ) T , L ( U k ) T Q Z k B B T Q U k L ( Z k ) L ( U k )

where Q Z k and Q U k correspond to the Laplacian matrix for the node set Z k and U k , respectively.

Minimizing E(L(S k )) is equal to differentiating E(L(S k )) with respect to L(U k ) and yields: L ( U k ) = - B T L ( Z k ) / Q U k . L(Z k ) are the object likelihoods acquired in the previous stage, i.e., L(Z k ) = L,2(Z k ). The diffused object likelihoods for U k are obtained by: L , 2 ( U k ) = - B T L , 2 ( Z k ) / Q U k . The nonsingularity of Q U k guarantees that the solution exists and is unique.

For each pixel x, its object likelihood L,2(x) is assigned as the object likelihood of the superpixel it belongs to. Taking L k , 3 ( x ) = ( L k , 0 ( x ) + L k , 1 ( x ) + L k , 2 ( x ) ) / 3 as input, we further invoke local refinement (see Section 4.1) to optimize object segmentation. Figure 3 demonstrates the segmentation results obtained before and after heat energy diffusion. As can be seen, although the saliency based initialization stage sometimes fails to extract the foreground objects, the stages of message transferring and heat energy diffusion can boost segmentation quality via sharing visual similarity of objects among images.

5 Experimental results

We apply our hierarchical graph based cosegmentation method to five public datasets with varying scenario and difficulty, including Weizmann horsesb, Caltech-4c, Oxford flowersd, UCSD birdse, and CMU iCosegf. All images of these datasets have ground truth masks, which allows us to evaluate segmentation performance quantitatively.

5.1 Datasets and implementation details

5.1.1 Weizmann horses

The Weizmann horses dataset has 324 images, in which each image depicts a different instance of the horse class. All horses pose in their side view and face to the same direction. Generally speaking, the horses preserve fixed geometric models and occupy most parts of the images.

5.1.2 Caltech-4

The Caltech-4 dataset includes four categories: airplane, car, face, and motorbike. We omit the grayscale car and use the other three categories for evaluation. This is a large-scale dataset, in which both the airplane and motorbike categories contains 800 images, and the face category contains 435 images. Similar to the Weizmann horses dataset, each image of Caltech-4 only depicts one object and the object occupy most parts of the image.

5.1.3 Oxford flowers

The Oxford flowers dataset has 17 different flower species with 80 images per category. Each image contains a finite number of repeating subjects. Some flowers like sunflower occupy most parts of the images, while others like lily of the valley scatter in the images.

5.1.4 UCSD birds

The UCSD birds dataset consists of 200 bird categories and 6033 images in total. This is a challenging dataset, where the birds appear in their natural habitat, change considerably in terms of viewpoint and illumination, and even in some cases only a part of the bird is visible.

5.1.5 CMU iCoseg

The CMU iCoseg dataset was introduced in [2]. It contains 643 images divided into 38 groups which are collected in various real situations such as soccer players in a field, airshows in the sky, a brown bear around a river. Omitting the background stuffs, each group contains one or several foreground objects of the same class.

With these datasets, we are interested in two evaluations: (1) unsupervised object segmentation over the Weizmann horses and Caltech-4 datasets where each image captures only one object and the objects typically preserve fixed orientation and well-defined geometric shape; (2) object cosegmentation on the Oxford flowers, UCSD birds and CMU iCoseg datasets where each image contains one or several objects that appear in their natural habitat. The first evaluation is performed to quantitatively compare our method with several traditional unsupervised object segmentation methods [810] which are only applicable in this setting. The second evaluation tests how well our method works with real world data.

5.1.6 Implementation details

In the initialization stage, we partition each image into 1000 or less superpixels, and extract about N = 50 heat sources from these superpixels. The other parameters are set as: the GMM component number M = 5, the thresholds T 1 = 0.38, T 2 = 0.52, the guided filtering’s parameters r = 7, ϵ=0.04, the variances σ s  = 0.004, σ z  = 0.08, and the weighting value λ = 0.5. All experiments are performed on a computer with 2.9 GHz CPU and 2 GB RAM.

5.2 Evaluation on Weizmann horses and Caltech-4

Here we compare our method over the Weizmann horses and Caltech-4 datasets with four related methods, including LOCUS [8], ClassCut [9], Arora et al. [10] and BiCos [18]. LOCUS [8], ClassCut [9], and Arora et al. [10] all take advantage of the objects’ inherent geometric models to jointly extract the foreground objects. In contrast, our method and BiCos [18] make no assumption about the foreground objects’ geometric shape. Given a ground truth mask, the segmentation accuracy is measured by the ratio of correctly labeled pixels with respect to the total number of pixels. According to the performance reported in their articles, Table 1 summarizes the segmentation accuracies over the four classes.
Table 1

The average segmentation accuracies obtained with LOCUS [ [8]], ClassCut [ [9]], Arora et al. [ [10]], BiCos [ [18]] and our method over the Weizmann horses and Caltech-4 datasets


Weizmann horses

Caltech airplane

Caltech face

Caltech motorbike






ClassCut [9]





Arora et al. [10]





BiCos [18]





Our method





The values in bold indicate the best results.

As can be seen, LOCUS [8], ClassCut [9] and Arora et al. [10] achieve better performance on the horse, motorbike and face categories, respectively. The reason is that the geometric models employed in those methods can strongly separate the foreground and background regions. Yet BiCos [18] and our method can still achieve competitive performance even without geometric models. Our method outperforms BiCos [18] on the airplane, face and motorbike categories, while BiCos [18] performs better on the horse category.

5.3 Evaluations on Oxford flowers, UCSD birds and CMU iCoseg

As baselines, three state-of-the-art methods (Joulin et al. [19], CoSand [4], and ClassCut [9]) are evaluated using their implementations with the default parameter settings. Joulin et al. [19] is a clustering based method, which takes superpixels as basic units and utilizes discriminative clustering to find common objects. CoSand [4] takes the large coherent, appearance similar regions among images as the foreground objects. ClassCut [9] is an energy iteration based method, which first obtains object bounding boxes by [24], and then builds a common class model with color, shape and position cues, finally extracts foreground objects via iteratively optimizing an MRF energy function and updating the class model.

The segmentation accuracy is defined as the proportion of pixels correctly classified as foreground or background by comparing the segmentation results with the ground truth. We take the form: F _Measure = 2 pre rec/(pre+rec), where pre is defined as the ratio of true positive pixels (i.e., the pixels labeled as foreground actually belong to foreground) to all labeled foreground pixels, and rec is defined as the ratio of true positive pixels to ground truth pixels. The average segmentation accuracies across all images are shown in Table 2. Several examples from the Oxford flowers, UCSD birds and CMU iCoseg datasets can be seen in Figure 5.
Table 2

The segmentation performance of CoSand [4], ClassCut [9], Joulin et al. [19] and our method over the Oxford flowers, UCSD birds and CMU iCoseg datasets


Oxford flowers

UCSD birds

CMU iCoseg








CoSand [4]







ClassCut [9]







Joulin et al. [19]







Our method (initial)







Our method (final)







The values in bold indicate the best results.

Figure 5

Segmentation comparison with ClassCut[9], Joulin et al.[19] and CoSand[4] on the Oxford flowers, UCSD birds and CMU iCoseg datasets. The regions in white indicate the foreground objects, while the regions in black stand for the backgrounds. (a) The input images, (b) ClassCut [9] Type="Bold">]’s results, (c) Joulin et al. [19]’s results, (d) CoSand [4]’s results, and (e) our method’s results.

5.3.1 Overall performance

As illustrated in Table 2 and Figure 5, our method outperforms the three methods in terms of segmentation accuracy as well as computation time. The method of Joulin et al. [19] takes superpixels as basic units, thus the objects’ boundaries are not clearly delineated as some superpixels merge foreground and background regions together. CoSand [4] only focuses on extracting the large coherent regions, it performs poorly for the figure-ground separation task. For example, it only extracts the black regions in the panda image set, failing to detect the white regions as foreground objects. ClassCut [9] can extract most of foreground regions, while it tends to omit some fragile regions like the petals in the Oxford flowers dataset. This is because the over-segmentation method it adopted has merged the boundaries with backgrounds. In contrast, our method can extract the whole foreground object accurately, no matter it is composed of one or several appearance distributions. We attribute this to the initialization scheme and the appearance sharing among images.

The benefit of segmenting all images together has been qualitatively shown in Figure 3. In Table 2, we quantitatively compare the segmentation accuracies obtained before and after sharing appearance similarity among images, obtaining that the accuracies are improved from 0.67, 0.52, 0.64 to 0.84, 0.68, 0.74 for the Oxford flowers, UCSD birds and CMU iCoseg datasets, respectively. Figure 6 compares some segmentation results obtained in the initialization and last stages. We can observe that most errors induced in the initialization stage are rectified finally.
Figure 6

Segmentation results obtained before and after sharing appearance similarity. The white regions denote the foreground objects, while the black regions stand for the backgrounds. (a) The input images, (b) the segmentation results obtained in the initial stage, and (c) the segmentation results obtained in the final stage.

5.3.2 Initialization performance

One contribution of our method is applying saliency detection with guided filtering to initially obtain foreground regions. To verify this stage’s effectiveness, we compare it with other initialization schemes, including GrabCut [25] used in BiCos [18], the large coherence regions presented in CoSand [4] and the initialization stage of ClassCut [9]. Since the initialization stages are all performed in still images, we randomly select 100 images from the three datasets for comparison.

In BiCos [18], GrabCut [25] estimates the foreground regions by optimizing a MRF energy function with the foreground and background color models. The foreground model is estimated with a bounding box in the center (50 % of the image size) and the background model is estimated from the rest. In CoSand [4], the foreground region comes from K-way segmentation. As suggested in the article, the number of segments K ranges from two to eight and the highest accuracies are reported. In ClassCut [9], a class model with shape, location and color cues is initialized by an object detector [24], and the foreground regions are estimated by optimizing a MRF energy function with the class model.

Table 3 shows the average segmentation accuracies as well as computation time for different initialization schemes. As can be seen, our initialization scheme achieves best performance for the UCSD birds and CMU iCoseg datasets, while GrabCut [25] reports higher accuracy than ours for the Oxford flowers dataset. We believe that this is due to the characteristics of the dataset, where the objects tend to be centered in the image and have a good contrast with the backgrounds. Under such constraint situation, the class models can be accurately estimated by GrabCut [25]. In contrast, the UCSD birds and CMU iCoseg datasets are more general, which verifies that our method is more flexible to be applied to real situations. Besides, our initialization scheme is significantly faster than those competitors.
Table 3

The segmentation performance obtained by the initial stages of BiCos [18], CoSand [4], ClassCut [9] and our method over the Oxford flowers, UCSD birds and CMU iCoseg datasets


Oxford flowers

UCSD birds

CMU iCoseg








BiCos [18]







CoSand [4]







ClassCut [9]







Our method







The values in bold indicate the best results.

5.3.3 Running time

One advantage of our method is its efficiency. Table 2 compares the running time of our methods with others. To further learn about how the time is cost in the whole process, we analyze each step’s performance on the Oxford flowers, UCSD birds and CMU iCoseg datasets. As shown in Table 4, most of the time is spent on extracting superpixels, while the main stages in the article, including saliency detection, local refinement, global message transferring and heat energy diffusion cost only 8.01 s in total for the Oxford flowers dataset, 4.92 s for the UCSD birds dataset and 4.32 s for the CMU iCoseg dataset.
Table 4

The running time cost by each stage of our method over the Oxford flowers, UCSD birds and CMU iCoseg datasets


Superpixel extraction

Heat source extraction

Saliency detection

Local refinement

Heat energy transfer and diffusion

Total time (s)

Oxford flowers







UCSD birds







CMU iCoseg







5.4 Failure cases

Our method works under an assumption that the interested objects should stand out as saliency. Yet such an assumption may not hold in some cases. Figure 7 illustrates some failure cases of our method for the images from the UCSD birds, Oxford flowers and CMU iCoseg datasets. As illustrated, although the bird, flower, and panda regions recur in the image sets, they are not too distinct with other regions to be detected as saliency. Our method fails to separate them from the backgrounds under such cases.
Figure 7

Failure cases. (a) The input images, (b) the segmentation results, and (c) the ground truth.

6 Conclusion

In this article, we present an iterative energy minimization method along a hierarchical graph for object cosegmentation. Starting from initialization by saliency detection, the method alternates via updating the latent parameters, refining object segmentation and propagating appearance distribution among images. Experiments demonstrate its superiority over start-of-the-art methods in aspects of accuracy and computation time. We attribute this to the combination of saliency detection, guided filtering and heat sources.

Still there are several issues remained to be explored. Currently, our method works under the assumption that the input images contain the common foreground objects. It is worth exploring a more general case that the input image set is composed of several groups where each group contains the common foreground objects. In addition, considering the parallelization capacity of our method, the system can be redesigned for implementation in parallel graphic hardware.




This work is supported by the National 863 Program of China under Grant No.2012AA011803, the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.20121102130004 and the Natural Science Foundation of China under Grant No.61170188.

Authors’ Affiliations

State Key Laboratory of Virtual Reality Technology & Systems, Beihang University


  1. Rother C, Kolmogorov V, Minka T, Blake A: Cosegmentation of image pairs by histogram matching. In IEEE Conference on Computer Vision and Pattern Recognition. Washington; 2006:993-1000.Google Scholar
  2. Batra D, Kowdle A, Parikh D: iCoseg: interactive co-segmentation with intelligent scribble guidance. In IEEE Conference on Computer Vision and Pattern Recognition. San Francisco; 2010:3169-3176.Google Scholar
  3. Vicente S, Kolmogorov V, Rother C: Cosegmentation revisited: models and optimization. In European Conference on Computer Vision. Heraklion; 2010:465-479.Google Scholar
  4. Kim G, Xing EP, Fei-Fei L, Kanade T: Distributed cosegmentation via submodular optimization on anisotropic diffusion. In IEEE International Conference on Computer Vision. Barcelona; 2011:169-176.Google Scholar
  5. Russell B, Efros A, Sivic J, Freeman W, Zisserman A: Using multiple segmentations to discover objects and their extent in image collections. In IEEE Conference on Computer Vision and Pattern Recognition. New York; 2006:1605-1614.Google Scholar
  6. Cao L, Fei-Fei L: Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In IEEE International Conference on Computer Vision. Rio de Janeiro; 2007:1-8.Google Scholar
  7. Zhao B, Fei-Fei L, Xing EP: Image segmentation with topic random field. In European Conference on Computer Vision. Heraklion; 2010:785-798.Google Scholar
  8. Winn J, Jojic N: LOCUS—learning object classes with unsupervised segmentation. In IEEE International Conference on Computer Vision. Beijing; 2005:756-763.Google Scholar
  9. Alexe B, Deselaers T, Ferrari V: ClassCut for unsupervised class segmentation. In European Conference on Computer Vision. Heraklion; 2010:380-393.Google Scholar
  10. Arora H, Loeff N, Forsyth DA, Ahuja N: Unsupervised segmentation of objects using efficient learning. In IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis; 2007:1-7.Google Scholar
  11. Chen Y, Zhu L, Yuille A, Zhang H: Unsupervised learning of probabilistic object models (POMs) for object classification, segmentation and recognition. In IEEE Conference on Computer Vision and Pattern Recognition. Anchorage; 2008:1-8.Google Scholar
  12. Kolmogorov V, Zabih R: What energy functions can be minimized via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell 2004, 2(26):147-159.View ArticleGoogle Scholar
  13. Felzenszwalb P: Efficient belief propagation for early vision. Int J. Comput. Vis 2006, 70: 41-54. 10.1007/s11263-006-7899-4View ArticleGoogle Scholar
  14. Levinshtein A, Stere A, Kutulakos KN, Fleet DJ, Dickinson SJ, Siddiqi K: TurboPixels: fast superpixels using geometric flows. IEEE Trans. Pattern Anal. Mach. Intell 2009, 31: 2290-2297.View ArticleGoogle Scholar
  15. Grady L: Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell 2006, 28: 1768-1783.View ArticleGoogle Scholar
  16. He K, Sun J, Tang X: Guided image filtering. In European Conference on Computer Vision. Heraklion; 2010:1-14.Google Scholar
  17. Cheng M, Zhang G, Mitra NJ, Huang X, Hu S: Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs; 2011:409-416.Google Scholar
  18. Chai Y, Lempitsky V, Zisserman A: BiCoS: a bi-level co-segmentation method for image classification. In IEEE International Conference on Computer Vision. Barcelona; 2011:2579-2586.Google Scholar
  19. Joulin A, Bach F, Ponce J: Discriminative clustering for image co-segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. San Francisco; 2010:1943-1950.Google Scholar
  20. Hofmann T: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn 2001, 43: 177-196.View ArticleGoogle Scholar
  21. Shi J, Malik J: Normalized cuts and image segmentation,. In IEEE Conference on Computer Vision and Pattern Recognition. San Juan; 1997:731-737.Google Scholar
  22. Mukherjee L, Singh V, Dyer C: Half-integrality based algorithms for cosegmentation of images. In IEEE Conference on Computer Vision and Pattern Recognition. Miami; 2009:2028-2035.Google Scholar
  23. Hochbaum D, Singh V: An efficient algorithm for co-segmentation. In IEEE Conference on Computer Vision. Kyoto; 2009:269-276.Google Scholar
  24. Alexe B, Deselaers T, Ferrari V: What is an object. In IEEE Conference on Computer Vision and Pattern Recognition. San Francisco; 2010:73-80.Google Scholar
  25. Rother C, Kolmogorov V, Blake A: Grabcut—interactive foreground extraction using iterated graph cuts. ACM Trans Graph 2004, 23(3):309-314. 10.1145/1015706.1015720View ArticleGoogle Scholar
  26. Duda R, Hart P, Stork D: Pattern classification. New York: Wiley Press; 2000.Google Scholar


© Li et al.; licensee Springer. 2013

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.