### 4.1 Initialization and local refinement

One major visual characteristic of objects is that they often stand out as saliency [24]. Based on this characteristic, we apply saliency detection to initially detect foreground regions in each image. Over various of saliency detection methods, we choose a recently proposed histogram based method [17] for its efficiency and effectiveness. Figure 2b demonstrates the saliency detection result of Figure 2a. We define the initial object likelihoods {L}_{k}^{\ast ,0} as the saliency likelihoods.

The segmentation results obtained by thresholding saliency likelihoods often contain holes and ambiguous boundaries. Motivated by the interactive segmentation methods, e.g., GrabCut [25], we utilize the inherent color Gaussian mixture model (GMM) in the image to update the object likelihoods. Two GMMs, one for the foreground and another for the background, are estimated in RGB color space. Each GMM is taken to be a full-covariance Gaussian mixture with *M* components. The GMM parameters are defined as: *θ*
_{
k
} = {*θ* *k* *J*|*J* ∈ {*B*,*F*}}, in which {\theta}_{k}^{J}\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}\left\{{\theta}_{m,k}^{J}\right|m\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}1,\dots ,M\}, {\theta}_{m,k}^{J}\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}({\mu}_{m,k}^{J},{\Sigma}_{m,k}^{J},{\omega}_{m,k}^{J}). ({\mu}_{m,k}^{F},{\Sigma}_{m,k}^{F},{\omega}_{m,k}^{F}) are the mean, covariance and weighting values for the foreground components, and ({\mu}_{m,k}^{B},{\Sigma}_{m,k}^{B},{\omega}_{m,k}^{B}) for the background components. The GMM parameters are estimated from the initial likelihoods as follows: (1) given two thresholds *T*
_{1} and *T*
_{2}, satisfying 0 < *T*
_{1} < *T*
_{2} < 1, we label the pixels with {L}_{k}^{\ast ,0}\left(x\right)\phantom{\rule{.1em}{0ex}}>\phantom{\rule{.1em}{0ex}}{T}_{1} as foreground, whereas {L}_{k}^{\ast ,0}\left(x\right)\phantom{\rule{.1em}{0ex}}<\phantom{\rule{.1em}{0ex}}{T}_{2} as background; (2) the colors of the foreground and background regions are clustered into *M* components using *K*-Means [26], respectively; (3) for each component, we statistically acquire its parameters {\theta}_{m,k}^{J}. The object likelihoods built on the GMMs are given by:

p\left({I}_{k}\right(x\left)\right|{\theta}_{k}^{J})=\underset{m}{max}(p\left({I}_{k}\right(x\left)\right|{\theta}_{m,k}^{J}\left)\right)

(4)

p\left({I}_{k}\right(x\left)\right|{\theta}_{m,k}^{J})={\omega}_{m,k}^{J}exp(-\parallel {I}_{k}\left(x\right)-{\mu}_{m,k}^{J}\parallel /{\Sigma}_{m,k}^{J})/\sqrt{\left|{\Sigma}_{m,k}^{J}\right|}

(5)

Segmenting objects by directly thresholding the updated object likelihoods will result in noises, as shown in Figure 2c. We use guided filtering [16] to remove noises. The main idea of guided filtering [16] is that, given the filter input *p*, the filter output *q* is locally linear to the guidance map *I*, *q*
_{
i
} = *a*
_{
x
}
*I*
_{
i
} + *b*
_{
x
}, ∀*i* ∈ *w*
_{
x
}, where *w*
_{
x
} is a window with radius *r* centered at the pixel *x*. By minimizing the difference between the filter input *p* and the filter output *q*, i.e., \mathit{\text{Err}}({a}_{x},{b}_{x})\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}{\Sigma}_{i\in {w}_{x}}({({p}_{i}-{q}_{i})}^{2}+\u03f5{a}_{x}^{2}), we can obtain *a*
_{
x
}, *b*
_{
x
} and the filter output *q*.

Based on guided filtering [16], we perform local refinement with three steps: (1) obtaining the foreground likelihood map *L*
_{
k,F
}={*p*(*I*
_{
k
}(*x*)|*θ* *k* *F*)} and the background likelihood map *L*
_{
k,B
} = {*p*(*I*
_{
k
}(*x*)|*θ* *k* *B*)}; (2) taking the grayscale image of *I*
_{
k
} as the guidance map, the two likelihood maps are filtered, respectively (denoting the filter outputs as {\widehat{L}}_{k,F} and {\widehat{L}}_{k,B}); (3) defining the refined object likelihoods as {L}_{k}^{\ast ,1}\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}{\widehat{L}}_{k,F}/({\widehat{L}}_{k,F}+{\widehat{L}}_{k,B}). Figure 2d shows the refinement result of Figure 2c. As can be seen, the guided filtering based scheme can significantly improve segmentation quality.

### 4.2 Global message transferring

Due to the diversity of realistic scenes, saliency based object segmentation sometimes fails to extract objects of the same class (see Figure 3c). The segmentation quality can be further boosted by sharing appearance similarity among images. Unlike other cosegmentation methods [8, 9, 18] which propagate the distributions of visual appearance in the pixel or superpixel level, we perform message propagation in the heat source level to reduce computation time.

As stated in Section 3, heat sources are the representative superpixels located in the centers of the clustering regions formed by coherent superpixels. The regions are formed by a bottom-up agglomerative clustering scheme. Specifically, given an image *I*, we first partition it into a collection of superpixels via Turbopixels [14] (see Figure 4b, in which superpixels are encircled with red boundaries). Then we build an intra-image graph *G*
_{
S
} = <*S*,*Y*
_{
S
} >, where *S* = {*s*
_{
i
}} is the superpixel set and *Y*
_{
S
} = {(*s*
_{
i
},*s*
_{
j
})} is the edge set connecting all pairs of adjacent superpixels. The edge weight is defined by Gaussian similarity between the normalized mean RGB color of the nodes, i.e., *w*(*s*
_{
i
},*s*
_{
j
}) = exp(-∥*I*(*s*
_{
i
})-*I*(*s*
_{
j
})∥^{2})/*σ*
_{
s
}, where *σ*
_{
s
} is a variance constant. Based on the graph *G*
_{
S
}, we use a greedy scheme to merge nodes one by one. Each time, we select the edge with the maximum weight value and merge its two nodes. This step is repeated until all nodes are merged into *N* regions. The central superpixel of each region is chosen as a heat source. Figure 4c demonstrates the clustering regions overlaid by the heat sources, in which the regions are encircled with green boundaries and the heat sources are colored in blue.

For message transferring among images, we construct an inter-image graph *G*
_{
Z
} = <*Z*,*Y*
_{
Z
} >. *G*
_{
Z
} is an undirected complete graph, where *Z* = {*z*
_{
i
}|*z*
_{
i
}∈*Z*
_{
k
},*k* = 1,…,*K*} includes all heat sources from the input images, *Y*
_{
Z
} = {(*z*
_{
i
},*z*
_{
j
})} connects all pairs of heat sources. We update the object likelihoods of the heat sources by minimizing a standard MRF energy function that is the sum of unary terms *E*
_{1}(·) and pairwise terms *E*
_{2}(·,·):

E\left(L\right(Z\left)\right)=\sum _{{z}_{i}\in Z}{E}_{1}\left({z}_{i}\right)+\lambda \sum _{({z}_{i},{z}_{j})\in {Y}_{Z}}{E}_{2}({z}_{i},{z}_{j})

(6)

where *λ* is the weighting value balancing the trade off between the unary terms and the pairwise terms.

The unary term *E*
_{1}(·) imposes individual penalties for assigning any likelihood *L*(*z*
_{
i
}) to the heat source *z*
_{
i
}. We rely on the object likelihoods *L*^{∗,1} acquired in the previous stage to define this term:

{E}_{1}\left({z}_{i}\right)=\left|L\left({z}_{i}\right)-\left(\sum _{x\in {z}_{i}}{L}^{\ast ,1}\left(x\right)/\left|{z}_{i}\right|\right)\right|

(7)

The pairwise term *E*
_{2}(·,·) defines to what extent adjacent heat sources should agree. It often depends on local observation. In our study, the pairwise potential takes the form:

{E}_{2}({z}_{i},{z}_{j})=w({z}_{i},{z}_{j})\left|L\right({z}_{i})-L({z}_{j}\left)\right|

(8)

where *w*(*z*
_{
i
},*z*
_{
j
}) is the edge weight, defined as *w*(*z*
_{
i
},*z*
_{
j
})= exp(-∥*f*(*z*
_{
i
})-*f*(*z*
_{
j
})∥^{2})/*σ*
_{
z
}, *σ*
_{
z
} is a variance constant. *f*(*z*) is a nine-dimensional descriptor for the heat source *z*, including three-dimensional mean Lab color feature, four-dimensional mean texture feature^{a} and two-dimensional mean position feature. This definition suggests that the larger the weight for the edge, the more similar the labels for its two nodes.

We utilize belief propagation [13] to optimize the energy function in several bounds. The main idea of belief propagation is to iteratively update a set of message maps between neighboring nodes. The message maps that are denoted by \left\{{m}_{{z}_{i}\to {z}_{j}}^{t}\right(L\left({z}_{j}\right)),t\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}1,\dots ,T\} represent the transferred message from one node to another at each iteration. In our study, the message maps are initially set to zero and updated as follows:

\begin{array}{c}{m}_{{z}_{i}\to {z}_{j}}^{t}\left(L\right({z}_{j}\left)\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\underset{L\left({z}_{i}\right)}{min}\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}{E}_{1}\left({z}_{i}\right)+\lambda {E}_{2}({z}_{i},{z}_{j})\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\sum _{{z}_{k}\in Z/{z}_{j}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{m}_{{z}_{k}\to {z}_{i}}^{t-1}\left(L\right({z}_{i}\left)\right)\phantom{\rule{0.3em}{0ex}}\right)\end{array}

(9)

Finally, a belief vector is computed for each node, {b}_{{z}_{i}}\left(L\right({z}_{i}\left)\right)\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}{E}_{1}\left({z}_{i}\right)+\sum _{{z}_{j}\in Z}{m}_{{z}_{j}\to {z}_{i}}^{T}\left(L\right({z}_{i}\left)\right), and the updated object likelihoods are expressed as: {L}^{\ast ,2}\left({z}_{i}\right)\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}{b}_{{z}_{i}}\left(0\right)/\left({b}_{{z}_{i}}\right(0)+{b}_{{z}_{i}}(1\left)\right).

### 4.3 Local heat energy diffusion

After global message transferring, the object likelihoods for heat sources preserve appearance similarity among images. We further diffuse them to other superpixels. As illustrated in the middle level of Figure 1, this is performed by heat energy diffusion within individual image. The heat energy diffusion can be imagined in the following situation: putting some heat sources in a metal plate, the heat energy will diffuse to other points as time goes by, finally each point will have a stable temperature. How to calculate such steady-state temperatures? This is a well-known Dirichlet energy minimization problem:

{u}^{\ast}=\text{arg}\underset{u}{min}\left(E\right(u\left)\right)=\text{arg}\underset{u}{min}\frac{1}{2}\underset{u\in \Omega}{\int}|\nabla u{|}^{2}\mathrm{d\Omega}

(10)

Grady [15] states the similar problem in discrete space with the term “random walks”. Based on a graph *G*
_{
X
} = <*X*,*Y*
_{
X
}>, where *X* = {*x*
_{
i
}} is the node set and *Y*
_{
X
} = {(*x*
_{
i
},*x*
_{
j
})} is the set of node pairs, the Dirichlet energy function takes the form:

E\left(u\right(X\left)\right)=\frac{1}{2}\sum _{({x}_{i},{x}_{j})\in {Y}_{X}}w({x}_{i},{x}_{j}){\left(u\right({x}_{i})-u({x}_{j}\left)\right)}^{2}

(11)

where *w*(*x*
_{
i
},*x*
_{
j
}) is the edge weight for the adjacent node pair (*x*
_{
i
},*x*
_{
j
}).

In our study, the random walks works on the graph *G* *k* *S* = <*S*
_{
k
},*Y* *k* *S*> for the image *I*
_{
k
}, where *S*
_{
k
} = {*s*
_{
i
}} is the superpixel set and *Y* *k* *S* = {(*s*
_{
i
},*s*
_{
j
})} is the edge set connecting all pairs of adjacent superpixels. The corresponding energy function is:

\begin{array}{c}E\left(L\right({S}_{k}\left)\right)=\frac{1}{2}\sum _{({s}_{i},{s}_{j})\in {Y}_{k}^{S}}w({s}_{i},{s}_{j}){\left(L\right({s}_{i})-L({s}_{j}\left)\right)}^{2}\hfill \\ \phantom{\rule{4.1em}{0ex}}=\frac{1}{2}L{\left({S}_{k}\right)}^{T}\mathit{\text{QL}}\left({S}_{k}\right)\hfill \end{array}

(12)

where *Q* = *D*-*A* is the Laplacian matrix, in which *A* = {*w*(*s*
_{
i
},*s*
_{
j
})} is the edge weight matrix, and *D* is a diagonal matrix with the entities D({s}_{i},{s}_{i})\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}\sum _{j}w({s}_{i},{s}_{j}).

We divide the node set *S*
_{
k
} into two parts: the heat sources *Z*
_{
k
} and the superpixels *U*
_{
k
} = *S*
_{
k
}-*Z*
_{
k
}. The energy function can be rewritten as:

E\left(L\right({S}_{k}\left)\right)=\left[L{\left({Z}_{k}\right)}^{T},L{\left({U}_{k}\right)}^{T}\right]\phantom{\rule{1em}{0ex}}\left[\begin{array}{c}{Q}_{{Z}_{k}}\phantom{\rule{1em}{0ex}}B\\ {B}^{T}\phantom{\rule{1em}{0ex}}{Q}_{{U}_{k}}\end{array}\phantom{\rule{.9em}{0ex}}\right]\phantom{\rule{1em}{0ex}}\left[\begin{array}{c}L\left({Z}_{k}\right)\\ L\left({U}_{k}\right)\end{array}\right]

(13)

where {Q}_{{Z}_{k}} and {Q}_{{U}_{k}} correspond to the Laplacian matrix for the node set *Z*
_{
k
} and *U*
_{
k
}, respectively.

Minimizing *E*(*L*(*S*
_{
k
})) is equal to differentiating *E*(*L*(*S*
_{
k
})) with respect to *L*(*U*
_{
k
}) and yields: L\left({U}_{k}\right)\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}-{B}^{T}L\left({Z}_{k}\right)/{Q}_{{U}_{k}}. *L*(*Z*
_{
k
}) are the object likelihoods acquired in the previous stage, i.e., *L*(*Z*
_{
k
}) = *L*^{∗,2}(*Z*
_{
k
}). The diffused object likelihoods for *U*
_{
k
} are obtained by: {L}^{\ast ,2}\left({U}_{k}\right)\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}-{B}^{T}{L}^{\ast ,2}\left({Z}_{k}\right)/{Q}_{{U}_{k}}. The nonsingularity of {Q}_{{U}_{k}} guarantees that the solution exists and is unique.

For each pixel *x*, its object likelihood *L*^{∗,2}(*x*) is assigned as the object likelihood of the superpixel it belongs to. Taking {L}_{k}^{\ast ,3}\left(x\right)\phantom{\rule{.1em}{0ex}}=\phantom{\rule{.1em}{0ex}}\left({L}_{k}^{\ast ,0}\right(x)+{L}_{k}^{\ast ,1}(x)+{L}_{k}^{\ast ,2}(x\left)\right)/3 as input, we further invoke local refinement (see Section 4.1) to optimize object segmentation. Figure 3 demonstrates the segmentation results obtained before and after heat energy diffusion. As can be seen, although the saliency based initialization stage sometimes fails to extract the foreground objects, the stages of message transferring and heat energy diffusion can boost segmentation quality via sharing visual similarity of objects among images.