- Research Article
- Open Access
Contextual Classification of Image Patches with Latent Aspect Models
EURASIP Journal on Image and Video Processing volume 2009, Article number: 602920 (2009)
We present a novel approach for contextual classification of image patches in complex visual scenes, based on the use of histograms of quantized features and probabilistic aspect models. Our approach uses context in two ways: (1) by using the fact that specific learned aspects correlate with the semantic classes, which resolves some cases of visual polysemy often present in patch-based representations, and (2) by formalizing the notion that scene context is image-specific—what an individual patch represents depends on what the rest of the patches in the same image are. We demonstrate the validity of our approach on a man-made versus natural patch classification problem. Experiments on an image collection of complex scenes show that the proposed approach improves region discrimination, producing satisfactory results and outperforming two noncontextual methods. Furthermore, we also show that co-occurrence and traditional (Markov random field) spatial contextual information can be conveniently integrated for further improved patch classification.
Associating semantic class labels to image regions is a fundamental task in computer vision, useful in itself for image and video indexing and retrieval, and as an intermediate step for higher-level scene analysis [1–3]. While many image area classification approaches segment an image using all pixels  or by predefining a block-based image grid [1, 3], in this work we consider local image patches characterized by viewpoint invariant descriptors . This image representation, based on patches, robust with respect to partial occlusion, clutter, and changes in viewpoint and illumination, has shown its applicability in a number of vision tasks [2, 6–9]. Local invariant regions do not cover the complete image, but they often occupy a considerable part of the scene and divide most of the scene into patches of salient content (Figure 1).
In general, the constituent parts of a scene do not exist in isolation, and the visual context—the spatial dependencies between scene parts—can be used to improve region classification [1, 10–12]. Two image regions, indistinguishable from each other when analyzed independently, might be discriminated as belonging to the correct class with the help of context knowledge. Broadly speaking, there exists a continuum of contextual models for image region classification. On one end, one would find explicit models like Markov random fields (MRFs), where spatial constraints are defined via local statistical dependencies between class region labels [10, 13], and between observations and labels . The other end would correspond to context-free models, where regions are classified assuming statistical independence between the region labels, and using only local observations [3, 6].
Lying between these two extremes, a type of scene representation of increasing use is the histogram of quantized image patches, referred to as bag-of-visterms [14, 15], bag-of-keypoints , bag-of-features , or bag-of-codewords [7, 18] in the literature. This representation is obtained by sampling local regions in an image and quantizing them into a finite set of patches according to their visual appearance, storing the patch occurrence in the image in the form of a histogram. On one hand, unlike explicit contextual models, spatial neighboring relations in this representation are discarded, and any ordering between the image regions disappears. On the other hand, unlike point-wise models, although the image regions are still local, the scene is represented collectively. This can explain why, despite the loss of strong spatial contextual information, this type of representation has been successfully used in a number of problems, including object matching , object categorization [9, 20], scene classification [7, 8, 21], and scene retrieval .
As a collection of discrete data, the histogram of patches is suitable for probabilistic models that capture a different form of context which is implicitly captured through patch co-occurrence. These models, originally designed for text collections (documents composed of terms), use discrete hidden aspect variables to model the co-occurrence of terms within and across documents. Examples include probabilistic latent semantic analysis (PLSA)  and latent Dirichlet allocation (LDA) . We have recently shown that the combination of PLSA and histogram of quantized invariant local descriptors can be successfully used for global scene classification [8, 14]. Given an unlabeled image set, PLSA captures aspects that represent the class structure of the collection, and provides a low-dimensional representation useful for classification. Similar conclusions with an LDA-related model were reached in .
In this paper, we address the problem of classifying image regions into semantic classes (see Figure 1) based on their associated patch number (throughout this paper, the term patch will mainly be used to denote an image region, and sometimes to denote the discrete index obtained from quantizing a local image descriptor of the patch; and in case of ambiguity, we will use the term quantized patch or patch number to denote the later). The main challenge for this task is that patches are not class-specific. As shown in Figure 2, image regions quantized into the same patch can appear in both man-made and nature views. This situation, although expected since quantized patch construction does not make use of class label information, constitutes a problematic form of visual polysemy. In this paper, we propose to take advantage of the context in which each patch appears, characterized by the patch histogram itself, to improve the classification of the corresponding image regions. Our contributions can be summarized as follows.
We show that the above-mentioned aspect models can be directly applied to patch classification, since specific aspects, although learned without class information, correlate with the classes of interest. These aspects can be easily labeled by hand or using a labeled image dataset, and used to classify their most likely patches accordingly.
The interpretation of a particular patch depends on what the other patches in the same image are, and this co-occurrence context is precisely captured by the estimated aspect mixture weights. We propose to formally include this contextual information in a new aspect model, so that even though patches appear in multiple classes, the information about the other patches in the same image can be used to improve discrimination (Figure 2).
We present results on a man-made versus natural image regions classification task, and show that the contextual information learned from co-occurrence improves the performance compared to a non-contextual approach. In our view, the proposed approach constitutes an interesting way to model visual context that could be applicable to other problems in computer vision.
We show, through the use of a Markov random field model, that standard spatial context can be integrated, resulting in an improvement of the final classification of image regions.
This paper is organized as follows. Section 2 reviews the closest related work. Section 3 presents our approach to local image patch classification. Section 4 introduces the image representation. Section 5 introduces the concept of an image as a mixture of latent aspects extended in Section 6 for contextual local patch classification. Section 7 discusses the two baseline models. Section 9 reports our results. Section 10 concludes the paper.
2. Related Work
Image region classification is a research field that has been developed for many years. Generally speaking, there are two main approach directions to the problem: classic pixel-based image segmentation and image region classification.
Classic image segmentation is defined as a process of partitioning the image into nonintersecting regions, such that each region is homogeneous and no union of two adjacent regions is homogeneous . The main issue is defining the property by which homogeneity is imposed. In most cases, the properties on which segmentation is based are gray-scale, color, texture, or a combination of those properties. Image segmentation defined this way is performed on each image independently. A review of traditional segmentation approaches is given in . Many more alternatives have been proposed. For instance, Carson et al.  present a blob-based segmentation method that models the color, texture, and position of all the pixels in a given image with a Gaussian mixture model (GMM), and attribute the label of its most likely GMM component to each pixel. This creates roughly homogeneous image regions called blobs, which are used for image retrieval, allowing the user to query the database at the blob level instead of the image level.
We consider the perspective on image region classification which is based on automatically defined patches. As we will show, this allows the regional classification of images based on class labels that are predefined and applicable to the whole database, and not based on an homogeneity criterion of the regions in an image. The region descriptors are classified into categories, and the density of the region class labels gives a regional classification of the image. We present a selection of image regional classification models that are based on class labels described in what follows, with regions that cover the whole image [1, 3, 26–28] or only a part of it [2, 6, 9].
The work in  relies on the normalized cuts segmentation algorithm  to segment the image into regions that are then quantized. Derived from the machine translation literature, an expectation-maximization (EM) estimates the probability distributions linking a set of words and blobs. Once the model parameters are learned, words are attached to each region. This region naming process is comparable to image segmentation.
Extending the MRF model, Kumar and Hebert proposed a discriminative random field (DRF) model that includes neighborhood interactions in the class labels, as well as at the observation level. They apply the DRF model to the segmentation of man-made structures in natural scenes , with an extraction of images features based on a grid of blocks that fully covers the image. The DRF model is trained on a set of manually segmented images, and then used to infer the segmentation into the two target classes.
Using a similar grid layout, Vogel and Schiele presented a two-stage classification framework to perform scene retrieval  and scene classification . This work performs an implicit scene segmentation as an intermediate step, classifying each image block into a set of semantic classes such as grass, rocks, or foliage.
To include global shape prior information in an MRF-based model formulation, Kumar et al. proposed an MRF part-based segmentation model, referred to as ObjCut, which represents object by means of segmented parts . This requires the explicit encoding of the spatial information relating parts and also the modeling of their deformations. The use of regions in this case reduces the invariance to occlusion, and the modeling has a high computational cost. Furthermore, the object to model must be composed of discriminative parts with known spatial relationships, which is not the case for scenes.
In , invariant local descriptors are used for an object detection task. All region descriptors in the training set are modeled with a Gaussian mixture model (GMM). A subset of the mixture components is then selected based on their estimated class likelihood ratio or mutual information, which are then used to classify new regions based on their local descriptors. In this non-contextual approach, new descriptors are independently classified into object or background regions, without taking the other descriptors in the same image into consideration. A similar approach introducing spatial contextual information through neighborhood statistics of the GMM components collected on training images is proposed in , where the learned-prior statistics are used for relaxation of the original region classification.
Leibe et al. proposed an implicit object model based on local invariant descriptors that jointly learns the discriminant descriptors for an object and their spatial relationships . Once again, this approach implies an existing spatial layout of the object parts which does not exist in the case of scenes.
As an extension to local descriptors' representation of images, probabilistic aspect models have been recently proposed to capture descriptors co-occurrence information with the use of a hidden variable (latent aspect). The work in  proposed a hierarchical Bayesian model that extended LDA for global categorization of natural scenes. This work showed that important patches for a class in an image can be found. However, the problem of local image patch classification was not addressed. The combination of local descriptors and PLSA for local patch classification has been illustrated in . However this work has two limitations. First, patches were classified into aspects, not classes, unless we assume as in  that there is a direct correspondence between aspects and semantic classes. This seems however a over-simplistic assumption in general. Secondly, evaluation was limited, for example,  does not conduct any objective performance evaluation.
To model both the object and the scene in an image, Russell et al.  proposed to use regions resulting from multiple unsupervised image segmentations to represent an image as an aggregate of sub-images. These sub-images are represented with bag-of-visterms and modeled with an latent aspect model. Starting from multiple image segmentations to maximize the chance that some segmented regions will correspond to actual objects is an interesting approach. There is however no guarantee that this will be true in general, and we therefore model images at the scale of patches in our work to ensure that no initial segmentation step will harm the image representation.
A preliminary version of our work first appeared in . Inspired by our work, Verbeek and Triggs proposed the extension of aspect modeling by integrating spatial models . The proposed approach introduces spatial coherence to the aspect model improving segmentation. However, the training of the latent aspect becomes limited to using labeled data, losing the possibility of learning visual co-occurrence from unlabeled data.
Unlike previous approaches, we propose a formal way to integrate the latent aspect modeling, learned in an unsupervised way from unlabeled data in the class information, and conduct a proper performance evaluation, validating our work with a comparison to a state-of-the-art baseline method. In addition, we explore the integration of the more traditional spatial MRF model into our system and compare the obtained results.
In the final stage of preparing this manuscript, new models were put forward to segment images by combining latent aspect models with quantized local patches. Cao and Fei-Fei presented a latent aspect model that assumes that each region of an image, obtained with an unsupervised segmentation algorithm in a first step, is generated from a single aspect . Regions are not modeled as separate documents, but as building parts of a given image which is itself defined by a mixture of aspects, contrarily to . Liu and Chen proposed to explicitly combine a latent aspect model with a known supervised segmentation algorithm . The segmentation algorithm and the aspect models are linked through a new variable that distinguishes foreground from background patches. This variable is successively obtained from the segmentation algorithm and then considered as an observed variable in the aspect model. A new segmentation is obtained when the aspect model is learned and this process iterates until the final segmentation is obtained.
3. Scene Patch Classification
The aspect models that we present in this paper allow to classify image regions into two classes, based on an estimated patch class likelihood taking advantage of the availability of a patch histogram. The method can be applied to image collection of regions defined randomly, by a regular grid (with or without overlap), or obtained with an interest point/region detector. Depending on what the considered image regions are, the resulting spatial distribution of class labels can produce local image classification with no label overlap (e.g., when using grid patches) [1, 3, 27], or a density-based image patch classification (when using interest point detectors) [2, 6]. In the later case, as shown on Figure 1, the classification of patches obtained by an interest point detector produces a sparse regional image classification. However, one advantage of using an interest point detector is that the identification of stable regions may exhibit better correspondence across the images than an arbitrary grid image division. In this paper, we decided to rely on an interest point detector to sample specific types of image regions to be classified, but the technique can be applied to any other form of region selection scheme.
As shown in Figure 3, our approach relies on the quantization of local region descriptors into a fixed number of patches using the K-means clustering algorithm. Compared to [2, 6], this quantization step simplifies the image representation from an undefined number of region descriptors per image to a histogram of patch labels. In addition, it allows to define a patch co-occurrence context of an image as a simple histogram, which can be further analyzed with an aspect model formulation. The patch histogram representation is discussed in details in Section 4.
Classification Principle: Likelihood Ratio
We rely on likelihood ratio computation to classify each patch of a given image into a class . The ratio is defined by
where the probabilities will be estimated using different models of the data, as described in Section 6, and the classification rule is
where is a threshold value. Thus, all image regions associated with the same patch will be classified in the same category according to the rule in (2). Note that, alternatively, we could have considered, as a classification rule, a ratio based on . The only difference with respect to using is to multiply the threshold value by the constant .
4. Image Representation
In what follows, we describe and further justify the four steps that we take to build our image representation: (i) detection of interest points/patches, (ii) computation of local descriptors, (iii) local descriptor quantization, and (iv) construction of the patch histogram.
4.1. Detection of Interest Points
The goal of the interest point detector is to automatically extract characteristic points from a given image, which are invariant to some geometric and photometric transformations. These points define image regions which are also invariant to the same transformations. Invariance is an important property since it ensures that given an image and its transformed version, equivalent image patches will be extracted from both, and the resulting image representation will be the same (within a certain estimation error).
Different point detectors have been proposed to extract regions of interest in images [5, 36]. They vary mostly by the amount of invariance they theoretically ensure, the image property they exploit to achieve invariance, and the type of image structures they are designed to detect. However, the increase in invariance also means that different points can become more similar after invariance regularization. In this way, we must also restrain invariance since a big increase in the degree of invariance may remove information about the local image content which is valuable for classification.
In this work, we use the difference of Gaussians (DOGs) point detector . This detector essentially identifies blob-like regions where a maximum or minimum of intensity occurs in the image, and it is invariant to translation, scale, rotation, and constant illumination variations. We chose this detector since it was shown to perform well in comparison studies previously published [37, 38], and also since we found it to be a good choice in practice for the task at hand, performing competitively compared to other detectors . The DOG detector is also faster than similarly performing, fully affine-invariant ones ,
4.2. Computation of Local Descriptors
Local descriptors are computed over the image region defined by each interest point which is automatically identified by the local interest point detector. These descriptors characterize the image content of each region in a compact way. In this work, we use the scale invariant feature transform (SIFT) feature as local descriptors . This choice was motivated by several publications [7, 37], where SIFT was found to work best. This descriptor is based on the gray-scale gradient information of images, and was shown to perform best in terms of specificity of region representation and robustness to image transformations . SIFT features are local histograms of edge directions computed over different parts of the region of interest, capturing the structure of the local image patch. In , it was shown that the use of 8 orientation directions and a grid of parts give a good compromise between descriptor size and accuracy of representation (see Figure 4), what gives a feature vector of size 128. Orientation invariance is achieved by estimating the dominant orientation of the local image patch using the orientation histogram of the keypoint region. All direction computations in the elaboration of the SIFT feature vector are then done with respect to this dominant orientation.
4.3. Local Descriptor Quantization
After the interest point detection and the computation of descriptors, an image is represented as a set of SIFT features characterizing the gray-scale texture of its regions of interest. We propose to quantify the descriptors to obtain a fixed size, compact representation of the image. A vocabulary of quantized descriptors —referred to as patches in this paper—is constructed by learning a K-means model from a set of local descriptors extracted from the training images, keeping the estimated means as patches. New local descriptors are mapped to the closest patch in the vocabulary according to the nearest neighbor rule:
where denotes the size of the patch set. We used the Euclidean distance in the clustering (and in (3)) and choose the number of clusters depending on the desired vocabulary size. The choice of the Euclidean distance to compare SIFT features is common .
Technically, the quantization of similar local descriptors into a single patch can be thought of as being similar to the stemming preprocessing step of text documents, which consists of replacing all words by their stem. The rationale behind stemming is that the meaning of words is carried by their stem rather than by their morphological variations . The same motivation applies to the quantization of descriptors into patches.
Furthermore, local descriptors will be considered as distinct whenever they are mapped to different patches, regardless of whether they are close or not in the SIFT feature space. This also resembles the text modeling approach which considers that all information is in the stems, and that any distance defined over their representation (e.g., strings in the case of text) carries no semantic meaning.
Figure 5 shows some examples of clusters of the SIFT descriptors. All of the examples of each cluster get the same label, and so get represented by the same patch. The patch number 157 represents a step function that might not be very specific to any of the man-made or natural image regions. On the contrary, the patches 240 and 14 represent cornered/squared structures that should mostly occur in man-made structures. Similarly, the samples from the patch 661 contain high frequencies that seem most likely to occur in natural structures.
4.4. Patch Histogram
After the feature quantization step, the image is reduced as a set of patches taken from a fixed size patch vocabulary that can be encoded as a patch histogram according to
where denotes the number of occurrences of patch in image . The construction of the patch histogram is illustrated in Figure 6. The patch histogram contains no information about spatial relationship between patches, similar to the bag-of-words text representation: even though word ordering contains a significant amount of information about the original data, it is completely removed from the final document representation.
5. Scenes as Mixtures of Aspects
The concept of aspect models for images has been recently applied to scene [8, 15, 21] and object [40, 41] categorization tasks, using the estimated distribution over aspects as a feature extraction process, or directly as a classifier. Under the assumption of an aspect model, an image can be seen as a mixture of unobserved (latent) aspects that are defined by consistent co-occurrences of image patches (or their features) within the image collection. A latent aspect is thus represented by its conditional distribution over patches , and an image is represented by the conditional distribution over aspects .
5.1. Scene Modeling with PLSA
Several latent aspect models, such as PLSA , LDA , and multinomial PCA (MPCA) , have been proposed in the literature for discrete components analysis. In this work, we consider the PLSA model , which assumes each occurrence of the patch to be independent from the image it belongs to given the latent variable , and corresponds to the joint probability expressed by
The joint probability of the observed variables is the marginalization over the latent aspects as expressed by
The multinomial distributions and are estimated with an EM algorithm on a set of training documents. As an illustration, Figure 7 shows the distribution over aspects for two images, for an aspect model trained on a collection of 6600 images of landscape and city images. The conditional distributions of patches given the aspects are represented on the right column of Figure 7, representing an aspect by its specific patch co-occurrence pattern. We see in Figure 7 that the patch histogram representations of the two images are modeled by two dissimilar distributions over aspects, reflecting their differences in content. The two images are composed of different patch co-occurrences that exist in the image collection, resulting in different image-dependent contexts.
The aspect indices have no intrinsic relevance to a specific class, given the unsupervised nature of the PLSA model learning. We can, however, inspect each aspect to observe the meaning that they may have in terms of our target classes. Aspects can be conveniently illustrated by their most probable images in a dataset. Given an aspect , images can be ranked according to
where is considered as uniform. Figure 8 displays the 10 best-ranked images for a given aspect to illustrate its potential "semantic meaning." The top-ranked images representing aspect 55 and 22 all clearly belong to the natural class, while the top-ranked images for aspect , and 37 contain a large majority of man-made structures. Aspect 12 seems to be mainly related to horizon/panoramic scenes, and contains landscape images only (top 10 images). However, as aspects are identified by analyzing the co-occurrence of visual patterns within local patches, they may be consistent from this point of view without allowing for a direct semantic interpretation as shown on Figure 8 for the aspect 45.
To further confirm the connection between the learned aspects and the target classes, we can measure objectively their relationship by defining the Precision and Recall paired values with respect to a given label at rank by
where Ret is the number of retrieved images, Rel is the total number of relevant images, and RelRet is the number of retrieved images that are relevant. Note here that for this experiment, we assume that images are only associated with one class label although they may contain some content (and patches) belonging to the other class. The precision/recall curves associated with each aspect-based image ranking considering either the natural or the man-made queries are shown in Figure 9. Those curves prove that some aspects are clearly related to the two classes, and confirm the observations made previously with respect to the aspect correspondences. As expected, aspect 45 does not appear in either the man-made or the natural top precision/recall curves. The natural related ranking of aspect 12 does not hold as clearly for higher recall values because the pattern of patch co-occurrences appearing in horizons that it captures is not exclusive to the natural class.
5.2. Mapping Aspects to Local Image Patches
As we have shown, images can be modeled as mixtures of aspects, and some aspects correlate with the man-made or the natural classes. The conditional distribution of patches given an aspect could be exploited for the classification of image regions in an image (given their patch label) as far as a class label is attached to the aspects. Based on the learned conditional distributions of patches given aspects, the most likely aspect can be attributed to a given patch according to
where we have assumed that the distribution over the latent aspects is uniform. In Figure 10, we show two examples of image region classification based on the concept of mixture of aspects. Based on the average precision (AP) measure of the ranking illustrated in Figure 9, we first select the ten aspects that are the more closely related to the man-made class and the ten aspects that are the more closely related to the natural class. Restricting the aspect attribution to these 20 man-made and natural aspects, each patch can be independently classified as a man-made or a natural descriptor based on (9). These two examples show a reasonable match between the ground-truth patch classification and the density of red and green points. The unsupervised learning based on co-occurrence thus allows to identify man-made and natural latent aspects in the data that can be later used to classify patches (and their corresponding image regions) into these two categories.
Based on this idea, we present two aspect models that extend PLSA model  for image patch classification in Section 6.
6. Aspect Models for Patch Classification
As introduced in Section 3, our goal is to classify image regions based on the estimated class likelihood ratio of their corresponding patches, as described in (1). In what follows, we propose two aspect models that estimate patch class-likelihoods based on the decomposition of scenes in a mixture of aspects. The observed data is composed of patch, document, and class triplets for each patch occurrence in a labeled training set.
The first aspect model classifies patches independently of the image they belong to and can be thus seen as a probabilistic formulation of the idea presented at the end of Section 5, where the assumption was that an aspect could only be associated with one class (i.e., or 1). The second model takes full advantage of the patch histogram context, and allows to estimate patch class-likelihoods that depend on the image that is considered.
6.1. Aspect Model 1
The first model associates a hidden variable with each observation leading to the joint probability defined by
This model introduces two conditional independence assumptions. The first one, traditionally encountered in aspects models, is that the occurrence of a patch is independent of the image it belongs to, given an aspect . The second assumption is that the occurrence of aspects is independent of the class the patch belongs to, that is, . Note that in (10), the class label refers to the class of one patch. Thus, different class labels can be associated with a given document, and the term reflects the degree to which an image indirectly belongs to a given class given its patches. The parameters of this model are learned using the maximum likelihood (ML) principle . The optimization is conducted using the expectation-maximization (EM) algorithm, allowing us to learn the aspect distributions and the mixture parameters .
Notice that, given our model, the EM equations do not depend on the patch class label. Besides, the estimation of the class-conditional probabilities does not require the use of the EM algorithm. We will exploit these points to train the aspect models on a large dataset (denoted ) where only a small part has been manually labeled at the image level (we denote this subset by ). This labeling at the image level allows to quickly annotate a large number of patches as man-made or natural, but does not imply that images have one class in general. We assume that patches have a class label.
Regarding the class-conditional probabilities, as the labeled set is only composed of man-made-only or natural-only images, we simply estimate them according to
where denotes the number of images belonging to class in the labeled set . Given this model, the likelihood we are looking for (cf. (1)) can be expressed as
where the conditional probabilities can in turn be estimated through marginalization over labeled documents,
These equations allow us to estimate the likelihood ratio as defined by (1). Note that this model extends PLSA by introducing the class variable .
6.2. Aspect Model 2
From (12), we see that despite the fact that the above model captures co-occurrence of the patches in the distributions , the context provided by the specific image has no direct impact on the likelihood. To explicitly introduce this context knowledge, we propose to evaluate the likelihood ratio of patches conditioned on the observed image ,
The evaluation of can be obtained by marginalizing over the aspects,
where we have exploited the conditional independence of patch occurrence given the aspect variable. Under model 1 assumptions, reduces to , which clearly shows the limitation of this model to introduce both context and class information for patch classification. To overcome this, we assume that the aspects depend on the class label as well. The parameters of this model are the aspect multinomial and the mixture multinomial , which could be estimated from labeled data by EM as before. However, as our model is not fully generative , only can be kept fixed, and we would have to estimate for each new image . We propose to separate the contributions to the aspect likelihood due to the class-aspect dependencies, from the contributions due to the image document-aspect dependencies. Thus, we propose to approximate as
where is still obtained using (13). The complete expression is given by
The main difference with (12) is the introduction of the contextual term , which means that patches will not only be classified based on them being associated to class-likely aspects but also on the specific occurrence of these aspects in the given image.
Inference on New Images
With aspect model 1 (and also with empirical distribution, cf. baseline model in Section 7), the patch classification decision is taken once for all at training time, through the patch co-occurrence analysis on the training images. Thus, for a new image , the extracted patches are directly assigned to their corresponding most likely class label. For aspect model 2, however, the likelihood ratio (14) involves the image-dependent aspect parameters (17). Given our approximation (16), these parameters have to be inferred for each new image, in a similar fashion as for PLSA . is estimated by maximizing the likelihood of the patch histogram of , fixing the learned parameters in the maximization step.
7. Baseline Models
We propose two complementary baseline models. The first baseline directly uses the empirical patch class-conditional distribution to classify new patches, the second learns a model from the region descriptors themselves, without quantification.
7.1. Empirical Class-Conditional Patch Distribution
Given a set of training data, the ratio in (1) can simply be estimated using the empirical distribution of patches, as done in . More precisely, given a set of manually segmented images into man-made and natural regions (e.g., Figure 1(c)), is estimated as the number of times the patch appears in regions of class , divided by the total number of visterms of class in the training set. Note that the class conditional probabilities could have been considered instead. This would have modified the estimated likelihood threshold value by . The class conditional probabilities are shown in Figure 11, indicating that there is a substantial amount of polysemy. Patches can simultaneously have a high probability given both classes (e.g., note that all patches appear at least 15% in the natural class).
Empirical estimation of probabilities is simple but may suffer from several drawbacks. A first one is that a significantly large amount of labeled training data might be necessary to avoid noisy estimates, especially when using large vocabulary sizes. A second one is that such estimation only reflects the individual patch occurrences, and does not account for any kind of relationship between them. Patches, however, correspond to regions extracted from full images, and, therefore, should be better interpreted in this context. In particular, we see in Figure 11 that even if and are estimated on the segmented image regions from the test set, there is an important ambiguity of the patches with respect to the two classes.
7.2. Gaussian Mixture Model Soft Assignment
Quantizing image regions into patches discard all information about the distance of each particular local descriptor to the corresponding patch cluster center . It results in a compact representation that can be seen as a drastic simplification of the data. Two descriptors of highly similar local textures can be assigned to different patches if they are close to the border between the two clusters. This intrinsic ambiguity of the quantization approach can be questioned. In the previous example, knowing that the two regions were in fact similar could be beneficial.
One way to address this issue is to perform a soft clustering of the region features. Instead of attributing a single patch number to each local descriptor, we allow for multiple cluster assignments with membership probabilities, assuming that the region descriptors have been generated by a Gaussian mixture model (GMM) . Given this soft clustering, we base the classification of image patches on the class likelihood ratio of their corresponding local descriptor given by
where is the total number of Gaussian distributions in the GMM, denotes the probability of the Gaussian having generated the local descriptor , and is the class likelihood ratio of the Gaussian . Note that the empirical baseline based on the K-means hard clustering becomes a special case of (18) when equals 1 for one Gaussian component and 0 for others. The posterior probability is computed as
where , and relate to the standard GMM formulation. Each feature is generated by a mixture of Gaussian distributions, with the following likelihood given the estimated GMM mixture weights , means , and standard deviations :
where is the Gaussian distribution of the component . The class likelihood ratio of a Gaussian distribution is given by
where is estimated by the ratio of importance of that generating Gaussian distribution for each class in the labeled images.
8. Markov Random Field (MRF) Regularization
The contextual modeling with latent aspects that we present in this paper can be conveniently integrated with traditional spatial regularization schemes. To investigate this, we present the embedding of our contextual model within the MRF framework  though other schemes could be similarly employed [2, 11, 28].
Let us denote by the set of sites , and by the set of cliques of two elements associated with a second-order neighborhood system defined over . The patch classification can be classically formulated using the maximum a posteriori (MAP) criterion as the estimation of the label field which is most likely to have produced the observation field . In our case, the set of sites is given by the set of interest points, the observations take their value in the set of patches , and the labels belong to the class set . Assuming that the observations are conditionally independent given the label field (i.e., ) and that the label field is an MRF over the graph , and due to the equivalence between MRF and Gibbs distribution (), the MAP formulation is equivalent to minimizing an energy function:
where is the regularization term which accounts for the prior spatial properties (homogeneity) of the label field whose local potentials are defined by
where is the cost of having neighbors with different labels while is a potential that will favor the man-made class label (if ) or the natural one (if ), and is the data-driven term for which the local potential are defined by
To implement the above regularization scheme, we need to specify a neighborhood system. Several alternatives could be employed, exploiting, for instance, the scale of the invariant detector (see, e.g., ). Here, we used a simpler scheme: two points and are defined to be neighbors if is one of the nearest neighbors of , and vice versa. For this set of experiments, we defined the neighborhood to be constituted by the five nearest neighbors. Finally, in the experiments, the minimization of the energy function of (22) was conducted using simulated annealing .
9. Experiments and Discussion
We validate our proposed models on natural versus man-made scene patch classification. In this section, we present our experimental setup, show a detailed performance evaluation illustrated with the patch classification results on a few test images, and we finally study the result of integrating spatial regularization.
9.1. Experimental Setup
Three image subsets from the Corel Stock Photo Library were used in the experiments. The first set () contains 6600 photos depicting mountains, forests, buildings, and cities. From this set, 6000 have no associated label, while the remaining subset is composed of 600 images whose content mainly belonged to one of the two classes, which were hand-labeled with a single class label leading to approximately 300 images of each class. This labeling at the image level is used to quickly label the corresponding patches. was used to construct the vocabulary and learn the aspect models, while was used, entirely or not, to estimate the patch likelihoods for each class. A third set , containing 485 images of man-made structures in natural landscapes hand-segmented with polygonal shapes to label the corresponding patches (Figure 1) was used to evaluate the methods.
9.1.2. Performance Measure
The global performance of the algorithm was assessed using the true positive rate (TPR, number of positive regions correctly classified over the total number of positive descriptors), false positive rate (FPR, number of false positives over the total number of negative descriptors), and true negative rate (TNR = 1-FPR), where man-made structure is the positive class. The FPR, TPR, and TNR values vary with the threshold applied for classification (see (2)).
Results are reported with a vocabulary size ranging from 1000 to 10 000 patches, a number of 1000 and 2000 GMM mixtures, and 20 aspects for aspect models 1 and 2.
9.2. Performance Evaluation
Figure 12 displays the receiver operating curve (TPR versus FPR) of the empirical patch distribution baseline and the GMM baseline for various parameter settings (a), and gives a comparison between the baseline approaches with the best parameter settings with the two proposed aspect models (b). The ROC curves are obtained by varying the likelihood ratio threshold , resulting in a different patch classification. The first observation relates to the influence of the patch vocabulary size, varied between 1000 and 10 000 patches in Figure 12(a), for the empirical patch distribution baseline. While no significant difference in performance is observed between the vocabulary of 1000 and 5000 patches, the performance decreases significantly for the 10 000 patch vocabulary. This effect is somehow counter-intuitive since a higher granularity in the quantization allows to define a finer classification decision function. It can be explained by a higher level of noise in the estimation of the likelihood ratio since the number of training images remains constant. In contrast, the GMM approach is more accurate, as it allows good likelihood ratio estimates while providing a finer feature space quantization through the soft assignment possibility. As in the two cases, no improvement is observed when using vocabulary sizes larger than 1000, we will use this number in what follows (for the empirical patch distribution and the aspect models).
As can be seen in Figure 12(b), the aspect model 1 performs slightly better than the empirical patch distribution baseline for all vocabulary sizes. However, the GMM baseline improves both the empirical patch distribution baseline and the aspect model 1 classification performance. The GMM approach is, therefore, the best image independent patch classification approach. Aspect model 2 outperforms significantly all other methods, proving the advantage of an image-dependent patch classification. Interestingly, the aspect models do not need 100% of the 600 labeled images for a good classification performance. We can observe in Figure 12 that the same patch classification performance is achieved when using only 5% of the labeled images (30 images) required to estimate the class-conditional aspect likelihood .
To further validate our approach, Table 1 reports the half-total-recognition rate (HTRR) measured by 10-fold cross-validation. For each of the folds, 90% of the test data is used to estimate the likelihood threshold leading to equal error rate (EER, obtained when TPR = TNR) on this data. This threshold is then applied on the remaining 10% (unseen images) of , from which the HTRR (HTRR = (TPR + TNR)/2) is computed. This table shows that the ranking observed on the ROC curve is clearly maintained, and that aspect model 2 results in a 7.5% performance relative increase with respect to the baseline approach.
As mentioned in Section 6, aspect model 1 and the empirical distribution method (GMM and K-means based) assign specific patches to the man-made or natural class independently of the actual image in which those patches occur. This sets a common limit on the maximum performance of both systems, which is referred here as the ideal case. This limit is given by attributing to each patch the class label corresponding to the class in which that patch occurs the most in the test data. On our data, this ideal case corresponds to an HTRR of 71.0% for the 1000 patches vocabulary, showing the advantage of an image-dependent patch classification method.
In order to have a chance of performing better than the ideal case, patches must be labeled differently depending on the specific image that is being segmented. Aspect model 2 switches patch class labels according to the contextual information gathered through the identification of image-specific latent aspects. In our data, successful class label switching occurs at least once for 727 out of the 1000 patches in our vocabulary.
9.3. Patch Classification Examples
The impact of the contextual model can also be observed on individual images. Figure 13 displays classification examples of man-made image patches, where likelihood thresholds were estimated at EER value. As can be seen, aspect model 2 improves the classification results with respect to the two other methods in two different ways. On one hand, in the first three examples, aspect model 2 increases the precision of the man-made patch classification, producing a slight decrease in the corresponding recall. On the other hand, the fourth example shows aspect model 2 producing a higher recall of man-made patches while maintaining a stable precision. In the fifth example, the occurrence of a strong context causes the whole image to be taken as a natural scene, also improving the total patch classification.
In Figure 14, five more examples of patch classification are shown. The first three rows illustrate natural image context examples that are correctly grasped by aspect model 2. The fourth row shows a correctly estimated man-made context that leads to an improved classification of patches for aspect model 2. In the fifth example, however, the overestimation of the man-made related aspects leads to patches that are dominantly classified as man-made. Nevertheless, overall, as indicated in Figure 12 and Table 1, the introduction of context by co-occurrence is beneficial.
9.4. Effects of the Markov Random Field Regularization
We investigate the impact of the combination with spatial regularization on the task of patch classification. The level of regularization is defined by (a larger value implies a larger effect). The regularization is conducted by starting at the equal error rate point, as defined in the 10-fold cross-validation experiments described in the preceding section. More precisely, for each of the folds, the threshold is used to set the prior on the labels by setting . Thus, in the experiments, when (i.e., no spatial regularization is enforced), we obtain the same results as in Table 1. In Figure 15, we see that the best patch classification performance corresponds to an HTRR of 73.1% and a of 0.35 with the empirical modeling, and an HTTR of 76.3% for a of 0.2 and aspect model 2. This latter value of is chosen for all the MRF illustrations reported in Figures 16 and 17.
The inclusion of the MRF relaxation boosted the performance of both aspect model 2 and empirical distribution. However, it is important to point out that aspect model 2 still outperforms the empirical distribution model though the boosting beneficiated most to the empirical distribution modeling. This was to be expected, as aspect model 2 was already capturing some of the contextual information that the spatial regularization can provide (notice also that the maximum is achieved for a smaller value of in aspect model 2).
Besides obtaining an increase of the HTRR value, we can visually notice a better spatial coherence of the patch classification as can be seen in Figures 16 and 17. We can observe in the images that the MRF relaxation process reduces the occurrence of isolated points, and tends to increase the density of points within segmented regions. We show in the last row of Figure 16 that as can be expected when using prior modeling, on certain occasions the MRF step can over-regularize the patch classification, causing the attribution of only one label to the whole image.
10. Conclusion and Future Work
In this paper, we proposed computational models to perform contextual regional classification of images. These models enable us to exploit a different form of visual context, based on the co-occurrence analysis of patches in the whole image rather than on the more traditional spatial relationships. Patch co-occurrence is summarized into aspects models, whose relevance is estimated for any new image, and used to evaluate class-dependent patch likelihoods. These models have been tested and validated on a man-made versus natural scene image patch classification task. One model has clearly shown to help in disambiguating polysemic patches based on the context they appear in. Producing satisfactory classification results, it outperforms state-of-the-art likelihood ratio methods , even when using soft assignment techniques.
Moreover, we investigated the use of Markov random field models to introduce spatial coherence in the final classification and show that the two types of context models can be integrated successfully. This additional information enable us to overcome some patch classification errors from the likelihood ratio and aspect models methods, increasing the final performance.
While the results presented here are encouraging, this task is complex, and there is a need for further improvements. Logical extensions would be the introduction of other sources of contextual information like color or scale and other forms of integration of spatial contextual information.
Kumar S, Hebert M: Discriminative random fields: a discriminative framework for contextual interaction in classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 2: 1150-1157.
Lazebnik S, Schmid C, Ponce J: Affine-invariant local descriptors and neighborhood statistics for texture recognition. Proceedings of the IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 649-655.
Vogel J, Schiele B: Natural scene retrieval based on a semantic modeling step. Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR '04), July 2004, Dublin, Ireland, Lecture Notes in Computer Science 3115: 207-215.
Shotton J, Winn J, Rother C, Criminisi A: TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. Proceedings of the 9th European Conference on Computer Vision (ECCV '06), May 2006, Graz, Austria, Lecture Notes in Computer Science 3951: 1-15.
Lowe DG: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2004,60(2):91-110.
Dorko G, Schmid C: Selection of scale-invariant parts for object class recognition. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 634-640.
Fei-Fei L, Perona P: A Bayesian hierarchical model for learning natural scene categories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 2: 524-531.
Quelhas P, Monay F, Odobez J-M, Gatica-Perez D, Tuytelaars T, Van Gool L: Modeling scenes with local descriptors and latent aspects. Proceedings of the IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 1: 883-890.
Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT: Discovering object categories in image collections. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China
Li SZ: Markov Random Field Modeling in Computer Vision. Springer, New York, NY, USA; 1995.
Kumar S, Hebert M: Man-made structure detection in natural images using a causal multiscale random field. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), June 2003, Madison, Wis, USA 1: 119-126.
Murphy K, Torralba A, Freeman W: Using the forest to see the trees: a graphical model relating features, objects and scenes. Proceedings of the Neural Information Processing Systems, December 2003, Vancouver, Canada
Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 1984,6(6):721-741.
Quelhas P, Monay F, Odobez J-M, Gatica-Perez D, Tuytelaars T: A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(9):1575-1589.
Quelhas P, Odobez J-M: Multi-level local descriptor quantization for bag-of-visterms image representation. Proceedings of the 6th ACM International Conference on Image and Video Retrieval (CIVR '07), July 2007, Amsterdam, The Netherlands 242-249.
Csurka G, Dance C, Fan L, Willamowski J, Bray C: Visual categorization with bags of keypoints. Proceedings of the European Conference on Computer Vision (ECCV '04), May 2004, Prague, Czech Republic 59-74.
Marszaek M, Schmid C: Spatial weighting for bag-of-features. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 2118-2125.
Wang G, Zhang Y, Fei-Fei L: Using dependent regions for object categorization in a generative framework. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 1597-1604.
Sivic J, Zisserman A: Video Google: a text retrieval approach to object matching in videos. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 2: 1470-1477.
Willamowski J, Arregui D, Csurka G, Dance CR, Fan L: Categorizing nine visual classes using local appearance descriptors. Proceedings of the Workshop on Learning for Adaptable Visual Systems (LAVS), in Conjunction with the International Conference on Pattern Recognition (ICPR '04), August 2004, Cambridge, UK
Bosch A, Zisserman A, Muñoz X: Scene classification via pLSA. Proceedings of the 9th European Conference on Computer Vision (ECCV '06), May 2006, Graz, Austria, Lecture Notes in Computer Science 3954: 517-530.
Hofmann T: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001,42(1-2):177-196.
Blei DM, Ng AY, Jordan MI: Latent Dirichlet allocation. Journal of Machine Learning Research 2003,3(4-5):993-1022.
Pal NR, Pal SK: A review on image segmentation techniques. Pattern Recognition 1993,26(9):1277-1294. 10.1016/0031-3203(93)90135-J
Carson C, Thomas M, Belongie S, Hellerstein JM, Malik J: Blob-world: a system for region-based image indexing and retrieval. In Proceedings of the 3rd International Conference on Visual Information Systems (VISUAL '99), June 1999, Amsterdam, The Netherlands. Springer; 509-516.
Duygulu P, Barnard K, de Freitas JFG, Forsyth DA: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. Proceedings of the 7th European Conference on Computer Vision (ECCV '02), May 2002, Copenhagen, Denmark, Lecture Notes in Computer Science 2353: 97-112.
Vogel J, Schiele B: A semantic typicality measure for natural scene categorization. Proceedings of the 26th DAGM Pattern Recognition Symposium, August-September 2004, Tübingen, Germany, Lecture Notes in Computer Science 3175: 195-203.
Verbeek JJ, Triggs B: Region classification with Markov field aspect models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007, Minneapolis, Minn, USA 1-8.
Shi J, Malik J: Normalized cuts and image segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, USA 731-737.
Kumar MP, Torr PHS, Zisserman A: Obj cut. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 18-25.
Leibe B, Leonardis A, Schiele B: Combined object categorization and segmentation with an implicit shape model. Proceedings of the 8th European Conference on Computer Vision (ECCV '04), May 2004, Prague, Czech Republic 17-32.
Russell BC, Freeman WT, Efros AA, Sivic J, Zisserman A: Using multiple segmentations to discover objects and their extent in image collections. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 1605-1612.
Monay F, Quelhas P, Odobez J-M, Gatica-Perez D: Integrating co-occurrence and spatial contexts on patch-based scene segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR '06), June 2006, New York, NY, USA 14.
Cao L, Fei-Fei L: Spatially coherent latent topic model for concurrent object segmentation and classification. Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, Brazil
Liu D, Chen T: Background cutout with automatic object discovery. Proceedings of IEEE International Conference on Image Processing (ICIP '07), September 2007, San Antonio, Tex, USA 4: 345-348.
Tuytelaars T, Van Gool L: Content-based image retrieval based on local affinely invariant regions. Proceedings of the 3rd International Conference on Visual Information and Information Systems, June 1999, Amsterdam, The Netherlands, Lecture Notes in Computer Science 1614: 493-500.
Mikolajczyk K, Schmid C: A performance evaluation of local descriptors. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), June 2003, Madison, Wis, USA 2: 257-263.
Mikolajczyk K, Tuytelaars T, Schmid C, et al.: A comparison of affine region detectors. International Journal of Computer Vision 2005,65(1-2):43-72. 10.1007/s11263-005-3848-x
Baeza-Yates R, Ribeiro-Neto B: Modern Information Retrieval. ACM Press, New York, NY, USA; 1999.
Monay F, Quelhas P, Gatica-Perez D, Odobez J-M: Constructing visual models with a latent space approach. Proceedings of the Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop (SLSFS '05), February 2006, Bohinj, Slovenia, Lecture Notes in Computer Science 3940: 115-126.
Fergus R, Fei-Fei L, Perona P, Zisserman A: Learning object categories from Google's image search. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 2: 1816-1823.
Buntine WL: Variational extensions to EM and multinomial PCA. Proceedings of the 13th European Conference on Machine Learning, August 2002, Helsinki, Finland, Lecture Notes in Computer Science 2430: 23-34.
Bishop C: Neural Networks for Pattern Recognition. Oxford University, Oxford, UK; 1995.
This work was funded by the Swiss National Science Foundation through the MULTI project and the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (NCCR (IM)2).