Contextual Classification of Image Patches with Latent Aspect Models
© Florent Monay et al. 2009
Received: 21 May 2008
Accepted: 24 October 2008
Published: 9 February 2009
We present a novel approach for contextual classification of image patches in complex visual scenes, based on the use of histograms of quantized features and probabilistic aspect models. Our approach uses context in two ways: (1) by using the fact that specific learned aspects correlate with the semantic classes, which resolves some cases of visual polysemy often present in patch-based representations, and (2) by formalizing the notion that scene context is image-specific—what an individual patch represents depends on what the rest of the patches in the same image are. We demonstrate the validity of our approach on a man-made versus natural patch classification problem. Experiments on an image collection of complex scenes show that the proposed approach improves region discrimination, producing satisfactory results and outperforming two noncontextual methods. Furthermore, we also show that co-occurrence and traditional (Markov random field) spatial contextual information can be conveniently integrated for further improved patch classification.
In general, the constituent parts of a scene do not exist in isolation, and the visual context—the spatial dependencies between scene parts—can be used to improve region classification [1, 10–12]. Two image regions, indistinguishable from each other when analyzed independently, might be discriminated as belonging to the correct class with the help of context knowledge. Broadly speaking, there exists a continuum of contextual models for image region classification. On one end, one would find explicit models like Markov random fields (MRFs), where spatial constraints are defined via local statistical dependencies between class region labels [10, 13], and between observations and labels . The other end would correspond to context-free models, where regions are classified assuming statistical independence between the region labels, and using only local observations [3, 6].
Lying between these two extremes, a type of scene representation of increasing use is the histogram of quantized image patches, referred to as bag-of-visterms [14, 15], bag-of-keypoints , bag-of-features , or bag-of-codewords [7, 18] in the literature. This representation is obtained by sampling local regions in an image and quantizing them into a finite set of patches according to their visual appearance, storing the patch occurrence in the image in the form of a histogram. On one hand, unlike explicit contextual models, spatial neighboring relations in this representation are discarded, and any ordering between the image regions disappears. On the other hand, unlike point-wise models, although the image regions are still local, the scene is represented collectively. This can explain why, despite the loss of strong spatial contextual information, this type of representation has been successfully used in a number of problems, including object matching , object categorization [9, 20], scene classification [7, 8, 21], and scene retrieval .
As a collection of discrete data, the histogram of patches is suitable for probabilistic models that capture a different form of context which is implicitly captured through patch co-occurrence. These models, originally designed for text collections (documents composed of terms), use discrete hidden aspect variables to model the co-occurrence of terms within and across documents. Examples include probabilistic latent semantic analysis (PLSA)  and latent Dirichlet allocation (LDA) . We have recently shown that the combination of PLSA and histogram of quantized invariant local descriptors can be successfully used for global scene classification [8, 14]. Given an unlabeled image set, PLSA captures aspects that represent the class structure of the collection, and provides a low-dimensional representation useful for classification. Similar conclusions with an LDA-related model were reached in .
We show that the above-mentioned aspect models can be directly applied to patch classification, since specific aspects, although learned without class information, correlate with the classes of interest. These aspects can be easily labeled by hand or using a labeled image dataset, and used to classify their most likely patches accordingly.
The interpretation of a particular patch depends on what the other patches in the same image are, and this co-occurrence context is precisely captured by the estimated aspect mixture weights. We propose to formally include this contextual information in a new aspect model, so that even though patches appear in multiple classes, the information about the other patches in the same image can be used to improve discrimination (Figure 2).
We present results on a man-made versus natural image regions classification task, and show that the contextual information learned from co-occurrence improves the performance compared to a non-contextual approach. In our view, the proposed approach constitutes an interesting way to model visual context that could be applicable to other problems in computer vision.
We show, through the use of a Markov random field model, that standard spatial context can be integrated, resulting in an improvement of the final classification of image regions.
This paper is organized as follows. Section 2 reviews the closest related work. Section 3 presents our approach to local image patch classification. Section 4 introduces the image representation. Section 5 introduces the concept of an image as a mixture of latent aspects extended in Section 6 for contextual local patch classification. Section 7 discusses the two baseline models. Section 9 reports our results. Section 10 concludes the paper.
2. Related Work
Image region classification is a research field that has been developed for many years. Generally speaking, there are two main approach directions to the problem: classic pixel-based image segmentation and image region classification.
Classic image segmentation is defined as a process of partitioning the image into nonintersecting regions, such that each region is homogeneous and no union of two adjacent regions is homogeneous . The main issue is defining the property by which homogeneity is imposed. In most cases, the properties on which segmentation is based are gray-scale, color, texture, or a combination of those properties. Image segmentation defined this way is performed on each image independently. A review of traditional segmentation approaches is given in . Many more alternatives have been proposed. For instance, Carson et al.  present a blob-based segmentation method that models the color, texture, and position of all the pixels in a given image with a Gaussian mixture model (GMM), and attribute the label of its most likely GMM component to each pixel. This creates roughly homogeneous image regions called blobs, which are used for image retrieval, allowing the user to query the database at the blob level instead of the image level.
We consider the perspective on image region classification which is based on automatically defined patches. As we will show, this allows the regional classification of images based on class labels that are predefined and applicable to the whole database, and not based on an homogeneity criterion of the regions in an image. The region descriptors are classified into categories, and the density of the region class labels gives a regional classification of the image. We present a selection of image regional classification models that are based on class labels described in what follows, with regions that cover the whole image [1, 3, 26–28] or only a part of it [2, 6, 9].
The work in  relies on the normalized cuts segmentation algorithm  to segment the image into regions that are then quantized. Derived from the machine translation literature, an expectation-maximization (EM) estimates the probability distributions linking a set of words and blobs. Once the model parameters are learned, words are attached to each region. This region naming process is comparable to image segmentation.
Extending the MRF model, Kumar and Hebert proposed a discriminative random field (DRF) model that includes neighborhood interactions in the class labels, as well as at the observation level. They apply the DRF model to the segmentation of man-made structures in natural scenes , with an extraction of images features based on a grid of blocks that fully covers the image. The DRF model is trained on a set of manually segmented images, and then used to infer the segmentation into the two target classes.
Using a similar grid layout, Vogel and Schiele presented a two-stage classification framework to perform scene retrieval  and scene classification . This work performs an implicit scene segmentation as an intermediate step, classifying each image block into a set of semantic classes such as grass, rocks, or foliage.
To include global shape prior information in an MRF-based model formulation, Kumar et al. proposed an MRF part-based segmentation model, referred to as ObjCut, which represents object by means of segmented parts . This requires the explicit encoding of the spatial information relating parts and also the modeling of their deformations. The use of regions in this case reduces the invariance to occlusion, and the modeling has a high computational cost. Furthermore, the object to model must be composed of discriminative parts with known spatial relationships, which is not the case for scenes.
In , invariant local descriptors are used for an object detection task. All region descriptors in the training set are modeled with a Gaussian mixture model (GMM). A subset of the mixture components is then selected based on their estimated class likelihood ratio or mutual information, which are then used to classify new regions based on their local descriptors. In this non-contextual approach, new descriptors are independently classified into object or background regions, without taking the other descriptors in the same image into consideration. A similar approach introducing spatial contextual information through neighborhood statistics of the GMM components collected on training images is proposed in , where the learned-prior statistics are used for relaxation of the original region classification.
Leibe et al. proposed an implicit object model based on local invariant descriptors that jointly learns the discriminant descriptors for an object and their spatial relationships . Once again, this approach implies an existing spatial layout of the object parts which does not exist in the case of scenes.
As an extension to local descriptors' representation of images, probabilistic aspect models have been recently proposed to capture descriptors co-occurrence information with the use of a hidden variable (latent aspect). The work in  proposed a hierarchical Bayesian model that extended LDA for global categorization of natural scenes. This work showed that important patches for a class in an image can be found. However, the problem of local image patch classification was not addressed. The combination of local descriptors and PLSA for local patch classification has been illustrated in . However this work has two limitations. First, patches were classified into aspects, not classes, unless we assume as in  that there is a direct correspondence between aspects and semantic classes. This seems however a over-simplistic assumption in general. Secondly, evaluation was limited, for example,  does not conduct any objective performance evaluation.
To model both the object and the scene in an image, Russell et al.  proposed to use regions resulting from multiple unsupervised image segmentations to represent an image as an aggregate of sub-images. These sub-images are represented with bag-of-visterms and modeled with an latent aspect model. Starting from multiple image segmentations to maximize the chance that some segmented regions will correspond to actual objects is an interesting approach. There is however no guarantee that this will be true in general, and we therefore model images at the scale of patches in our work to ensure that no initial segmentation step will harm the image representation.
A preliminary version of our work first appeared in . Inspired by our work, Verbeek and Triggs proposed the extension of aspect modeling by integrating spatial models . The proposed approach introduces spatial coherence to the aspect model improving segmentation. However, the training of the latent aspect becomes limited to using labeled data, losing the possibility of learning visual co-occurrence from unlabeled data.
Unlike previous approaches, we propose a formal way to integrate the latent aspect modeling, learned in an unsupervised way from unlabeled data in the class information, and conduct a proper performance evaluation, validating our work with a comparison to a state-of-the-art baseline method. In addition, we explore the integration of the more traditional spatial MRF model into our system and compare the obtained results.
In the final stage of preparing this manuscript, new models were put forward to segment images by combining latent aspect models with quantized local patches. Cao and Fei-Fei presented a latent aspect model that assumes that each region of an image, obtained with an unsupervised segmentation algorithm in a first step, is generated from a single aspect . Regions are not modeled as separate documents, but as building parts of a given image which is itself defined by a mixture of aspects, contrarily to . Liu and Chen proposed to explicitly combine a latent aspect model with a known supervised segmentation algorithm . The segmentation algorithm and the aspect models are linked through a new variable that distinguishes foreground from background patches. This variable is successively obtained from the segmentation algorithm and then considered as an observed variable in the aspect model. A new segmentation is obtained when the aspect model is learned and this process iterates until the final segmentation is obtained.
3. Scene Patch Classification
The aspect models that we present in this paper allow to classify image regions into two classes, based on an estimated patch class likelihood taking advantage of the availability of a patch histogram. The method can be applied to image collection of regions defined randomly, by a regular grid (with or without overlap), or obtained with an interest point/region detector. Depending on what the considered image regions are, the resulting spatial distribution of class labels can produce local image classification with no label overlap (e.g., when using grid patches) [1, 3, 27], or a density-based image patch classification (when using interest point detectors) [2, 6]. In the later case, as shown on Figure 1, the classification of patches obtained by an interest point detector produces a sparse regional image classification. However, one advantage of using an interest point detector is that the identification of stable regions may exhibit better correspondence across the images than an arbitrary grid image division. In this paper, we decided to rely on an interest point detector to sample specific types of image regions to be classified, but the technique can be applied to any other form of region selection scheme.
Classification Principle: Likelihood Ratio
where is a threshold value. Thus, all image regions associated with the same patch will be classified in the same category according to the rule in (2). Note that, alternatively, we could have considered, as a classification rule, a ratio based on . The only difference with respect to using is to multiply the threshold value by the constant .
4. Image Representation
In what follows, we describe and further justify the four steps that we take to build our image representation: (i) detection of interest points/patches, (ii) computation of local descriptors, (iii) local descriptor quantization, and (iv) construction of the patch histogram.
4.1. Detection of Interest Points
The goal of the interest point detector is to automatically extract characteristic points from a given image, which are invariant to some geometric and photometric transformations. These points define image regions which are also invariant to the same transformations. Invariance is an important property since it ensures that given an image and its transformed version, equivalent image patches will be extracted from both, and the resulting image representation will be the same (within a certain estimation error).
Different point detectors have been proposed to extract regions of interest in images [5, 36]. They vary mostly by the amount of invariance they theoretically ensure, the image property they exploit to achieve invariance, and the type of image structures they are designed to detect. However, the increase in invariance also means that different points can become more similar after invariance regularization. In this way, we must also restrain invariance since a big increase in the degree of invariance may remove information about the local image content which is valuable for classification.
In this work, we use the difference of Gaussians (DOGs) point detector . This detector essentially identifies blob-like regions where a maximum or minimum of intensity occurs in the image, and it is invariant to translation, scale, rotation, and constant illumination variations. We chose this detector since it was shown to perform well in comparison studies previously published [37, 38], and also since we found it to be a good choice in practice for the task at hand, performing competitively compared to other detectors . The DOG detector is also faster than similarly performing, fully affine-invariant ones ,
4.2. Computation of Local Descriptors
4.3. Local Descriptor Quantization
where denotes the size of the patch set. We used the Euclidean distance in the clustering (and in (3)) and choose the number of clusters depending on the desired vocabulary size. The choice of the Euclidean distance to compare SIFT features is common .
Technically, the quantization of similar local descriptors into a single patch can be thought of as being similar to the stemming preprocessing step of text documents, which consists of replacing all words by their stem. The rationale behind stemming is that the meaning of words is carried by their stem rather than by their morphological variations . The same motivation applies to the quantization of descriptors into patches.
Furthermore, local descriptors will be considered as distinct whenever they are mapped to different patches, regardless of whether they are close or not in the SIFT feature space. This also resembles the text modeling approach which considers that all information is in the stems, and that any distance defined over their representation (e.g., strings in the case of text) carries no semantic meaning.
4.4. Patch Histogram
5. Scenes as Mixtures of Aspects
The concept of aspect models for images has been recently applied to scene [8, 15, 21] and object [40, 41] categorization tasks, using the estimated distribution over aspects as a feature extraction process, or directly as a classifier. Under the assumption of an aspect model, an image can be seen as a mixture of unobserved (latent) aspects that are defined by consistent co-occurrences of image patches (or their features) within the image collection. A latent aspect is thus represented by its conditional distribution over patches , and an image is represented by the conditional distribution over aspects .
5.1. Scene Modeling with PLSA
5.2. Mapping Aspects to Local Image Patches
Based on this idea, we present two aspect models that extend PLSA model  for image patch classification in Section 6.
6. Aspect Models for Patch Classification
As introduced in Section 3, our goal is to classify image regions based on the estimated class likelihood ratio of their corresponding patches, as described in (1). In what follows, we propose two aspect models that estimate patch class-likelihoods based on the decomposition of scenes in a mixture of aspects. The observed data is composed of patch, document, and class triplets for each patch occurrence in a labeled training set.
The first aspect model classifies patches independently of the image they belong to and can be thus seen as a probabilistic formulation of the idea presented at the end of Section 5, where the assumption was that an aspect could only be associated with one class (i.e., or 1). The second model takes full advantage of the patch histogram context, and allows to estimate patch class-likelihoods that depend on the image that is considered.
6.1. Aspect Model 1
This model introduces two conditional independence assumptions. The first one, traditionally encountered in aspects models, is that the occurrence of a patch is independent of the image it belongs to, given an aspect . The second assumption is that the occurrence of aspects is independent of the class the patch belongs to, that is, . Note that in (10), the class label refers to the class of one patch. Thus, different class labels can be associated with a given document, and the term reflects the degree to which an image indirectly belongs to a given class given its patches. The parameters of this model are learned using the maximum likelihood (ML) principle . The optimization is conducted using the expectation-maximization (EM) algorithm, allowing us to learn the aspect distributions and the mixture parameters .
Notice that, given our model, the EM equations do not depend on the patch class label. Besides, the estimation of the class-conditional probabilities does not require the use of the EM algorithm. We will exploit these points to train the aspect models on a large dataset (denoted ) where only a small part has been manually labeled at the image level (we denote this subset by ). This labeling at the image level allows to quickly annotate a large number of patches as man-made or natural, but does not imply that images have one class in general. We assume that patches have a class label.
These equations allow us to estimate the likelihood ratio as defined by (1). Note that this model extends PLSA by introducing the class variable .
6.2. Aspect Model 2
The main difference with (12) is the introduction of the contextual term , which means that patches will not only be classified based on them being associated to class-likely aspects but also on the specific occurrence of these aspects in the given image.
Inference on New Images
With aspect model 1 (and also with empirical distribution, cf. baseline model in Section 7), the patch classification decision is taken once for all at training time, through the patch co-occurrence analysis on the training images. Thus, for a new image , the extracted patches are directly assigned to their corresponding most likely class label. For aspect model 2, however, the likelihood ratio (14) involves the image-dependent aspect parameters (17). Given our approximation (16), these parameters have to be inferred for each new image, in a similar fashion as for PLSA . is estimated by maximizing the likelihood of the patch histogram of , fixing the learned parameters in the maximization step.
7. Baseline Models
We propose two complementary baseline models. The first baseline directly uses the empirical patch class-conditional distribution to classify new patches, the second learns a model from the region descriptors themselves, without quantification.
7.1. Empirical Class-Conditional Patch Distribution
Empirical estimation of probabilities is simple but may suffer from several drawbacks. A first one is that a significantly large amount of labeled training data might be necessary to avoid noisy estimates, especially when using large vocabulary sizes. A second one is that such estimation only reflects the individual patch occurrences, and does not account for any kind of relationship between them. Patches, however, correspond to regions extracted from full images, and, therefore, should be better interpreted in this context. In particular, we see in Figure 11 that even if and are estimated on the segmented image regions from the test set, there is an important ambiguity of the patches with respect to the two classes.
7.2. Gaussian Mixture Model Soft Assignment
Quantizing image regions into patches discard all information about the distance of each particular local descriptor to the corresponding patch cluster center . It results in a compact representation that can be seen as a drastic simplification of the data. Two descriptors of highly similar local textures can be assigned to different patches if they are close to the border between the two clusters. This intrinsic ambiguity of the quantization approach can be questioned. In the previous example, knowing that the two regions were in fact similar could be beneficial.
8. Markov Random Field (MRF) Regularization
The contextual modeling with latent aspects that we present in this paper can be conveniently integrated with traditional spatial regularization schemes. To investigate this, we present the embedding of our contextual model within the MRF framework  though other schemes could be similarly employed [2, 11, 28].
To implement the above regularization scheme, we need to specify a neighborhood system. Several alternatives could be employed, exploiting, for instance, the scale of the invariant detector (see, e.g., ). Here, we used a simpler scheme: two points and are defined to be neighbors if is one of the nearest neighbors of , and vice versa. For this set of experiments, we defined the neighborhood to be constituted by the five nearest neighbors. Finally, in the experiments, the minimization of the energy function of (22) was conducted using simulated annealing .
9. Experiments and Discussion
We validate our proposed models on natural versus man-made scene patch classification. In this section, we present our experimental setup, show a detailed performance evaluation illustrated with the patch classification results on a few test images, and we finally study the result of integrating spatial regularization.
9.1. Experimental Setup
Three image subsets from the Corel Stock Photo Library were used in the experiments. The first set ( ) contains 6600 photos depicting mountains, forests, buildings, and cities. From this set, 6000 have no associated label, while the remaining subset is composed of 600 images whose content mainly belonged to one of the two classes, which were hand-labeled with a single class label leading to approximately 300 images of each class. This labeling at the image level is used to quickly label the corresponding patches. was used to construct the vocabulary and learn the aspect models, while was used, entirely or not, to estimate the patch likelihoods for each class. A third set , containing 485 images of man-made structures in natural landscapes hand-segmented with polygonal shapes to label the corresponding patches (Figure 1) was used to evaluate the methods.
9.1.2. Performance Measure
The global performance of the algorithm was assessed using the true positive rate (TPR, number of positive regions correctly classified over the total number of positive descriptors), false positive rate (FPR, number of false positives over the total number of negative descriptors), and true negative rate (TNR = 1-FPR), where man-made structure is the positive class. The FPR, TPR, and TNR values vary with the threshold applied for classification (see (2)).
Results are reported with a vocabulary size ranging from 1000 to 10 000 patches, a number of 1000 and 2000 GMM mixtures, and 20 aspects for aspect models 1 and 2.
9.2. Performance Evaluation
As can be seen in Figure 12(b), the aspect model 1 performs slightly better than the empirical patch distribution baseline for all vocabulary sizes. However, the GMM baseline improves both the empirical patch distribution baseline and the aspect model 1 classification performance. The GMM approach is, therefore, the best image independent patch classification approach. Aspect model 2 outperforms significantly all other methods, proving the advantage of an image-dependent patch classification. Interestingly, the aspect models do not need 100% of the 600 labeled images for a good classification performance. We can observe in Figure 12 that the same patch classification performance is achieved when using only 5% of the labeled images (30 images) required to estimate the class-conditional aspect likelihood .
Half total recognition rate (in percent).
As mentioned in Section 6, aspect model 1 and the empirical distribution method (GMM and K-means based) assign specific patches to the man-made or natural class independently of the actual image in which those patches occur. This sets a common limit on the maximum performance of both systems, which is referred here as the ideal case. This limit is given by attributing to each patch the class label corresponding to the class in which that patch occurs the most in the test data. On our data, this ideal case corresponds to an HTRR of 71.0% for the 1000 patches vocabulary, showing the advantage of an image-dependent patch classification method.
In order to have a chance of performing better than the ideal case, patches must be labeled differently depending on the specific image that is being segmented. Aspect model 2 switches patch class labels according to the contextual information gathered through the identification of image-specific latent aspects. In our data, successful class label switching occurs at least once for 727 out of the 1000 patches in our vocabulary.
9.3. Patch Classification Examples
9.4. Effects of the Markov Random Field Regularization
The inclusion of the MRF relaxation boosted the performance of both aspect model 2 and empirical distribution. However, it is important to point out that aspect model 2 still outperforms the empirical distribution model though the boosting beneficiated most to the empirical distribution modeling. This was to be expected, as aspect model 2 was already capturing some of the contextual information that the spatial regularization can provide (notice also that the maximum is achieved for a smaller value of in aspect model 2).
Besides obtaining an increase of the HTRR value, we can visually notice a better spatial coherence of the patch classification as can be seen in Figures 16 and 17. We can observe in the images that the MRF relaxation process reduces the occurrence of isolated points, and tends to increase the density of points within segmented regions. We show in the last row of Figure 16 that as can be expected when using prior modeling, on certain occasions the MRF step can over-regularize the patch classification, causing the attribution of only one label to the whole image.
10. Conclusion and Future Work
In this paper, we proposed computational models to perform contextual regional classification of images. These models enable us to exploit a different form of visual context, based on the co-occurrence analysis of patches in the whole image rather than on the more traditional spatial relationships. Patch co-occurrence is summarized into aspects models, whose relevance is estimated for any new image, and used to evaluate class-dependent patch likelihoods. These models have been tested and validated on a man-made versus natural scene image patch classification task. One model has clearly shown to help in disambiguating polysemic patches based on the context they appear in. Producing satisfactory classification results, it outperforms state-of-the-art likelihood ratio methods , even when using soft assignment techniques.
Moreover, we investigated the use of Markov random field models to introduce spatial coherence in the final classification and show that the two types of context models can be integrated successfully. This additional information enable us to overcome some patch classification errors from the likelihood ratio and aspect models methods, increasing the final performance.
While the results presented here are encouraging, this task is complex, and there is a need for further improvements. Logical extensions would be the introduction of other sources of contextual information like color or scale and other forms of integration of spatial contextual information.
This work was funded by the Swiss National Science Foundation through the MULTI project and the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (NCCR (IM)2).
- Kumar S, Hebert M: Discriminative random fields: a discriminative framework for contextual interaction in classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 2: 1150-1157.View ArticleGoogle Scholar
- Lazebnik S, Schmid C, Ponce J: Affine-invariant local descriptors and neighborhood statistics for texture recognition. Proceedings of the IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 649-655.View ArticleGoogle Scholar
- Vogel J, Schiele B: Natural scene retrieval based on a semantic modeling step. Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR '04), July 2004, Dublin, Ireland, Lecture Notes in Computer Science 3115: 207-215.View ArticleGoogle Scholar
- Shotton J, Winn J, Rother C, Criminisi A: TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. Proceedings of the 9th European Conference on Computer Vision (ECCV '06), May 2006, Graz, Austria, Lecture Notes in Computer Science 3951: 1-15.Google Scholar
- Lowe DG: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2004,60(2):91-110.View ArticleGoogle Scholar
- Dorko G, Schmid C: Selection of scale-invariant parts for object class recognition. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 634-640.View ArticleGoogle Scholar
- Fei-Fei L, Perona P: A Bayesian hierarchical model for learning natural scene categories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 2: 524-531.Google Scholar
- Quelhas P, Monay F, Odobez J-M, Gatica-Perez D, Tuytelaars T, Van Gool L: Modeling scenes with local descriptors and latent aspects. Proceedings of the IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 1: 883-890.View ArticleGoogle Scholar
- Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT: Discovering object categories in image collections. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, ChinaGoogle Scholar
- Li SZ: Markov Random Field Modeling in Computer Vision. Springer, New York, NY, USA; 1995.View ArticleGoogle Scholar
- Kumar S, Hebert M: Man-made structure detection in natural images using a causal multiscale random field. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), June 2003, Madison, Wis, USA 1: 119-126.Google Scholar
- Murphy K, Torralba A, Freeman W: Using the forest to see the trees: a graphical model relating features, objects and scenes. Proceedings of the Neural Information Processing Systems, December 2003, Vancouver, CanadaGoogle Scholar
- Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 1984,6(6):721-741.View ArticleMATHGoogle Scholar
- Quelhas P, Monay F, Odobez J-M, Gatica-Perez D, Tuytelaars T: A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(9):1575-1589.View ArticleGoogle Scholar
- Quelhas P, Odobez J-M: Multi-level local descriptor quantization for bag-of-visterms image representation. Proceedings of the 6th ACM International Conference on Image and Video Retrieval (CIVR '07), July 2007, Amsterdam, The Netherlands 242-249.View ArticleGoogle Scholar
- Csurka G, Dance C, Fan L, Willamowski J, Bray C: Visual categorization with bags of keypoints. Proceedings of the European Conference on Computer Vision (ECCV '04), May 2004, Prague, Czech Republic 59-74.Google Scholar
- Marszaek M, Schmid C: Spatial weighting for bag-of-features. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 2118-2125.Google Scholar
- Wang G, Zhang Y, Fei-Fei L: Using dependent regions for object categorization in a generative framework. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 1597-1604.Google Scholar
- Sivic J, Zisserman A: Video Google: a text retrieval approach to object matching in videos. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 2: 1470-1477.View ArticleGoogle Scholar
- Willamowski J, Arregui D, Csurka G, Dance CR, Fan L: Categorizing nine visual classes using local appearance descriptors. Proceedings of the Workshop on Learning for Adaptable Visual Systems (LAVS), in Conjunction with the International Conference on Pattern Recognition (ICPR '04), August 2004, Cambridge, UKGoogle Scholar
- Bosch A, Zisserman A, Muñoz X: Scene classification via pLSA. Proceedings of the 9th European Conference on Computer Vision (ECCV '06), May 2006, Graz, Austria, Lecture Notes in Computer Science 3954: 517-530.Google Scholar
- Hofmann T: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001,42(1-2):177-196.View ArticleMATHGoogle Scholar
- Blei DM, Ng AY, Jordan MI: Latent Dirichlet allocation. Journal of Machine Learning Research 2003,3(4-5):993-1022.MATHGoogle Scholar
- Pal NR, Pal SK: A review on image segmentation techniques. Pattern Recognition 1993,26(9):1277-1294. 10.1016/0031-3203(93)90135-JView ArticleGoogle Scholar
- Carson C, Thomas M, Belongie S, Hellerstein JM, Malik J: Blob-world: a system for region-based image indexing and retrieval. In Proceedings of the 3rd International Conference on Visual Information Systems (VISUAL '99), June 1999, Amsterdam, The Netherlands. Springer; 509-516.Google Scholar
- Duygulu P, Barnard K, de Freitas JFG, Forsyth DA: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. Proceedings of the 7th European Conference on Computer Vision (ECCV '02), May 2002, Copenhagen, Denmark, Lecture Notes in Computer Science 2353: 97-112.Google Scholar
- Vogel J, Schiele B: A semantic typicality measure for natural scene categorization. Proceedings of the 26th DAGM Pattern Recognition Symposium, August-September 2004, Tübingen, Germany, Lecture Notes in Computer Science 3175: 195-203.Google Scholar
- Verbeek JJ, Triggs B: Region classification with Markov field aspect models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007, Minneapolis, Minn, USA 1-8.Google Scholar
- Shi J, Malik J: Normalized cuts and image segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, USA 731-737.Google Scholar
- Kumar MP, Torr PHS, Zisserman A: Obj cut. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 18-25.Google Scholar
- Leibe B, Leonardis A, Schiele B: Combined object categorization and segmentation with an implicit shape model. Proceedings of the 8th European Conference on Computer Vision (ECCV '04), May 2004, Prague, Czech Republic 17-32.Google Scholar
- Russell BC, Freeman WT, Efros AA, Sivic J, Zisserman A: Using multiple segmentations to discover objects and their extent in image collections. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006, New York, NY, USA 2: 1605-1612.Google Scholar
- Monay F, Quelhas P, Odobez J-M, Gatica-Perez D: Integrating co-occurrence and spatial contexts on patch-based scene segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR '06), June 2006, New York, NY, USA 14.Google Scholar
- Cao L, Fei-Fei L: Spatially coherent latent topic model for concurrent object segmentation and classification. Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, BrazilGoogle Scholar
- Liu D, Chen T: Background cutout with automatic object discovery. Proceedings of IEEE International Conference on Image Processing (ICIP '07), September 2007, San Antonio, Tex, USA 4: 345-348.Google Scholar
- Tuytelaars T, Van Gool L: Content-based image retrieval based on local affinely invariant regions. Proceedings of the 3rd International Conference on Visual Information and Information Systems, June 1999, Amsterdam, The Netherlands, Lecture Notes in Computer Science 1614: 493-500.View ArticleGoogle Scholar
- Mikolajczyk K, Schmid C: A performance evaluation of local descriptors. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), June 2003, Madison, Wis, USA 2: 257-263.Google Scholar
- Mikolajczyk K, Tuytelaars T, Schmid C, et al.: A comparison of affine region detectors. International Journal of Computer Vision 2005,65(1-2):43-72. 10.1007/s11263-005-3848-xView ArticleGoogle Scholar
- Baeza-Yates R, Ribeiro-Neto B: Modern Information Retrieval. ACM Press, New York, NY, USA; 1999.Google Scholar
- Monay F, Quelhas P, Gatica-Perez D, Odobez J-M: Constructing visual models with a latent space approach. Proceedings of the Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop (SLSFS '05), February 2006, Bohinj, Slovenia, Lecture Notes in Computer Science 3940: 115-126.View ArticleGoogle Scholar
- Fergus R, Fei-Fei L, Perona P, Zisserman A: Learning object categories from Google's image search. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 2: 1816-1823.View ArticleGoogle Scholar
- Buntine WL: Variational extensions to EM and multinomial PCA. Proceedings of the 13th European Conference on Machine Learning, August 2002, Helsinki, Finland, Lecture Notes in Computer Science 2430: 23-34.MathSciNetGoogle Scholar
- Bishop C: Neural Networks for Pattern Recognition. Oxford University, Oxford, UK; 1995.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.