Multimodal few‑shot classification without attribute embedding

Two main

in a joint embedding space of visual and semantic information [7], and leveraging textual descriptions to generate additional training images [8,9].In this paper, while the visual information is represented in an embedding space, the semantic information is used in its raw form, i.e., as image attributes without embedding.There are two advantages that follow.First, the model becomes interpretable in that the specific attributes that contribute to a particular classification is immediately evident.In fact, the model is inherently interpretable implying that there is no need of additional visualization steps such as layerwise relevance propagation [10] or Grad-CAM and its variants [11].Although [12] visualize attention maps directly from the learned latent embedding of a variational autoencoder (VAE), it is not evident that the method would work in a few-shot setting.In [13], a separate language model is trained to produce an explanation for a given feature embedding and class label.The second advantage is that, in effect, the model is learning the composition of an image.Compositionality is integral to the human representation of a concept by way of decomposing it into parts and cognitive studies have demonstrated its critical role in human vision [14].The semantic information in an image readily provided by the model identifies not only the parts in the image, but also their attributes.
The method proposed in this paper is based on a hybrid autoencoder framework, i.e., it contains a basic autoencoder as well as a variational autoencoder (VAE).First, image features are encoded into a semantic space by the former where the encoded semantic features are enforced to be close to the binary ground-truth attributes.The attributes contributing to a classification can be directly read off from the semantic space.The learnt encoder weights are retained while the VAE encoder learns the embedding of image features into the visual space.The visual and semantic features are concatenated and decoded for classification.Thus, the proposed framework performs multimodal few-shot classification while directly providing the attributes for a classified image, this enables interpretability of the model.An example of application for the model would be in the aerospace manufacturing industry which needs to identify defective parts.It is required by regulatory authorities for such critical industries that decisions taken by a machine is interpretable.By performing classification that are interpretable, the proposed framework can be extended to such real-world scenarios.
In [7], the authors have addressed the question of how expensive it is to label images with attributes and furthermore, how to define a vocabulary for the attributes.The authors state that labeling 159 category-level attributes for a subset of ImageNet images took only 3 days, noting that the novel classes did not need attribute annotation.
The main contributions of this paper are as follows: (1) a multimodal framework is proposed consisting of a basic autoencoder whose semantic features can directly provide the attributes for a classified image, and a VAE that provides the visual features that are concatenated with semantic features for few-shot classification.(2) The model outperforms other multimodal methods in 50-way-K-shot on CUB dataset, and have comparable results in fewer number of ways, while simultaneously explaining the results without additional training.Note that neither zero-shot learning [15,16] nor generalized fewshot learning [17] is addressed in this paper.(3) The model is shown to be interpretable using the attributes predicted as part of the model.

Few-shot learning
A comprehensive survey on few-shot learning is presented in [18], where the three main approaches are data augmentation, reducing the space of hypotheses that maps input features to labels and searching for parameters of the best hypothesis.Here, some stateof-the-art methods are picked and and reviewed briefly.
One approach to data augmentation for FSL is to use hand-crafted rules such as [19] where rotation and translation of images are used to augment the dataset to train VAE for new samples generation.By generating new and additional samples, it circumvents the problem of having small amount of data in a few-shot learning setting.The main disadvantage in using hand-crafted rules is that it is impossible to enumerate all possible variations.In addition, it is costly and requires domain knowledge to apply these handcrafted rules.To avoid using hand-crafted rules, Generative Adversarial Network (GAN) models have been used for generation of synthetic data [20].The GAN model learns from other larger datasets first, before being used on few-shot datasets.This allows new samples to be generated without the need of hand-crafted rules.One disadvantage of this is that the larger datasets used to train the GAN model have to be related to the fewshot datasets.This in practice might not be possible all the time.
Storing knowledge from training data as external memory is a method that reduces the space of hypotheses.Instead of using embedding of samples directly, [21] uses these embeddings as a key to query the most similar memory.The values from the most similar memory are combined to form the representation of the sample.This allows the model to predict based on these representation instead of embeddings of few-shot examples that might not be sufficient.The downside of this is that manipulating the memory is expensive resulting in the memory being typically small.When the memory is full, a decision on which memory to replace has to be made which may result in worse performance if the wrong memory is chosen.More recently, by using embedding learning to reduce the search space, [2] suggests that training a linear classifier on top of a supervised or self supervised representation is sufficient for few-shot learning.By doing so, it is believed that only a good embedding is required for few-shot classification.This however raises the question on how to obtain a good embedding.In a scenario for classifying common objects, obtaining an embedding for that might be easy.However, in more complicated cases for example in industrial applications, such embeddings might not be easily obtainable.
FSL can also be approached by searching for parameters of the best hypothesis.This can be done by teaching an optimizer to find the optimal update parameters at every step [22].This allows the step size or search direction to be determined by the learned optimizer instead of using hand-crafted update rules.However, this may raise issues on how to transfer the optimizer between different data sources or granularities.

Multimodal few-shot learning
Multimodal few-shot learning extends regular few-shot learning by including additional modalities.Common forms of modalities include attributes like those used in zero-shot learning such as shape and color of bird parts, objects in a scene, or the color and behavior of animals.Song et al. [23] did a more recent survey on few-shot learning and included a section on multimodal few-shot learning.Here, some methods in multimodal few-shot learning are discussed.
CCAM [24] encodes context and visual information to the same embedding space, allowing the use of contextual prototypes to be used instead of real labels.Classification can be done by comparing the distance to these prototypes.This allows for contextual information to be used instead for few-shot learning.Schwartz et al. [25] refine visual prototypes by using Multi-Layer Perceptrons (MLPs) to generate semantic prototypes.The visual and semantic prototypes are then combined to form a final prototype.This allows the prototype to be more suitable for few-shot learning.Pahde et al. [26] introduce hallucinated samples conditioned on textual description as an augmented dataset.This enables additional samples to be generated for few-shot learning.Using prototypical networks, [8] propose a multimodal prototypical network model to map semantic information to the visual space for a better prototype.This allows additional visual features to be generated as prototypes for multimodal few-shot learning.Instead of using modality-alignment methods, [27] introduces an adaptive modality mixture mechanism for multimodal few-shot learning.By combining the visual and textual modality, it significantly improves performance on few-shot learning problems.Chen et al. [28] approaches the problem by mapping samples to the semantic space and augmenting them with noise.These augmented features can then be projected back to the visual space to generate new samples.By constraining image representations to predict natural language, [29] uses language as a bottleneck to reconstruct features used for classification.This use of natural language proved to help significantly with few-shot learning.Mu et al. [30] use the same constraints on image representation and classify with the learned visual representations, further improving the previous method without the need of the language model at test time.This makes the model simpler and more data efficient.Compositionality for multimodal few-shot learning is addressed in [7] by applying constraints to ensure that the similarities between the image and textual representation is maximized.Improved performance of multimodal few-shot learning is shown when learning compositions in images.
These methods achieve good performance in multimodal few-shot learning, but are not interpretable due to the need of embedding the modalities.Interpretable few-shot learning has been presented in [13], but it is only through learning a language model to generate captions from their feature space, which is a separate module to the basic framework.This requires an additional step on top of classification.
Multimodal learning has also been explored in a zero-shot learning setting.To account for problems in generation shifts such as semantic inconsistency, [31] introduce a generative flow framework using conditional affine coupling layers.Some generalized zero-shot learning methods introduce small amounts of visual information to their existing framework for generalized few-shot learning setting.By aligning embeddings of visual and other modalities using VAE, [17] perform generalized zero-shot learning using embeddings of other modalities as classification samples.Samuel et al. [32] address the zero-shot learning problem by introducing a module that address the long-tail problem by rebalancing class predictions across classes on a sample-by-sample basis.In both methods, by introducing small samples of visual information, it is shown that these methods are able to perform generalized few-shot learning as well.

Variational autoencoders (VAE)
An autoencoder consists of a combination of an encoder and a decoder and aims to learn a latent representation of given data by constraining information flowing through a network with a bottleneck.The latent representation is learnt by minimizing a loss between the input x to the encoder and the output f(x) of the decoder.If the loss is the L1 distance, it is given by where the expectation is taken over the training data.
The irregularity in the latent space of an autoencoder arising due to overfitting is addressed by forcing the encoder to return a distribution over the latent space as opposed to a single point.This structure is called a variational autoencoder [33].Consider the latent representation z to be sampled from a prior distribution p(z).The encoder outputs parameters to the distribution of the encoded variable given input as q θ (z | x) .The decoder takes as input the latent representation and outputs the param- eters to the distribution of the data, i.e., p φ (x | z) .The loss function is given by The first term is the reconstruction loss that forces the decoder to learn to reconstruct the data.The second term is a regularizer whose objective is to make the distributions returned by the encoder to be close to a standard Gaussian distribution.This enables the latent space to be organized such that encodings of similar datapoints are close together.It is implemented through the KL divergence between the encoder's distribution q θ (z | x) and the prior p(z).

Proposed method
Let C be the discrete label space.For multimodal few-shot setting, the training set that contain sufficient number of data samples, where x i is the image data with class label y i and m (u)  i are different modalities indexed by u.The support set Figure 1 shows the architecture of the proposed multimodal few-shot learning framework.It consists of (i) a basic autoencoder whose encoder is denoted as E s and decoder as D, (ii) a VAE whose encoder is E v and shares the same decoder D and (iii) two addi- tional encoders denoted as CE s and CE v that ensure cyclic consistency for the semantic and visual components, respectively.
Following [8], CNN features from a pre-trained ResNet-18 model are used.These features are encoded into a semantic representation of the image as z s , which is the latent variable that the model learns to represent the attributes of the image.It is from this representation (1) (2) that the attributes contributing towards a certain classification can be read off.The learning of z s is achieved by forcing the representation towards a binary attribute vector t whose element is 1 indicating the presence of an attribute and 0 otherwise.Eventually, the learnt latent representation consists of the probability of each attribute in the image enforced through a sigmoid function.Thus, the proposed framework is naturally interpretable without additional computation for visualization or other models for interpretation.A Bernoulli latent representation in an autoencoder has been studied in [34].The formulation of the semantic loss can be considered as a multi-label problem where the targets are the attributes and is written in the form of a binary cross-entropy loss as where t ij is the jth attribute for sample i and p(t ij ) is its estimated probability.
The VAE uses encoder E v to learn a latent representation of the visual feature x i param- eterized by a Gaussian distribution with mean µ v and standard deviation σ v .These param- eters are employed to randomly generate the latent visual representation z v through the reparameterization trick [33].The output of both encoders is then concatenated into a latent variable z c that fully describes the image in terms of the visual features as well as its attributes.A single decoder D is used to reconstruct the image features from z c .Thus, the two encoders and a decoder form what is named in this paper as a hybrid autoencoder that consists of a basic autoencoder and a VAE.The reconstruction loss for the basic autoencoder is taken as Fig. 1 The proposed model for multimodal few-shot learning with losses between components shown.Reconstruction loss is enforced on the input x i and reconstructed features x ′ i .Semantic loss is enforced on the output of encoder E s to ensure that it is as close as possible to the ground-truth attributes.To ensure cyclic consistency, two different losses are used for each modality: the cosine similarity loss between z s and z ′ s for the semantics, and L2 loss between z v and z ′ v for the visual where x ′ i is the reconstructed image feature.As described in Sect.3, the loss function for a VAE consists of the reconstruction loss and a regularizer.Since the semantic information is encapsulated in the basic autoencoder, the expected log-likelihood term of the VAE reconstruction loss is replaced with L R .The regularizer term is retained as the KL divergence between the distribution of encoder E v , q φ (z v | x i ) , and p θ (z v ) .Taken together, the loss function of the hybrid autoencoder is where α and β are the weights for each component.
Next, cyclic consistency is considered to ensure that the reconstructed feature x ′ i generated by the hybrid autoencoder can fully encode both the semantic as well as visual information in the image.To this end, x ′ i is converted back into latent semantic and visual representations, z ′ s and z ′ v , respectively, through encoders CE s and CE v that have the same structure as E s or E v .The semantic similarity between the encoded semantic representation is the cosine distance.For visual similarity, the output of CE v is compared with the output of E v using L2 loss.
Applying the visual cyclic constraint to µ v results in a softer constraint compared to applying it to z v since in the latter case, the similarity of a specific sample to its reconstruc- tion is maximized as opposed to maximizing to the mean of the distribution.From the experiments shown in Table 1, it is observed that a more restrictive constraint results in a better performing model, specifically at lower number of shots.Maximizing the similarities corresponds to minimizing the representation consistency loss [35]: where cos is the cosine similarity and ǫ = 0.1 is a constant to avoid division by zero.
The overall loss for the model combines the semantic loss, the hybrid autoencoder loss and the cyclic loss as where γ and δ are weights for semantic and cyclic loss, respectively. (5)

Experiments
First, the datasets used to evaluate the proposed hybrid autoencoder framework are described.Next, some implementation details are discussed and then comparison of the model with other state-of-the-art methods for multimodal few-shot image classification are provided.Finally, the effectiveness of the inherent interpretability of the model is demonstrated.

Datasets
The model is evaluated on three datasets: Caltech-UCSD Birds-200-2011 (CUB) [36], SUN [37], and Animals with Attributes 2 (AWA2) [15].CUB dataset is an image dataset of 200 bird species and their attributes.The image features used were obtained from the final pooling layer of a ResNet-18 similar to [8].In addition, to ensure that support and test classes are disjoint from the classes in ResNet, the proposed training splits in [15]

Implementation details
Image features are embeddings of 512 dimensions obtained from an ImageNet pretrained ResNet-18 from PyTorch.Semantic features are the class-level attributes provided with the dataset whose values range from 0.0 to 1.0.A binary attribute vector is created by assigning an attribute as 1 if its value is greater than zero and 0 otherwise.For the encoders and decoders, the sizes of the hidden layers are 1560 and 1660, respectively.The size of the latent space z v is 64, and the size of z s follows the number of attributes in a dataset.The optimizer that is used is an Adam optimizer with a learning rate of 0.00015.The class of the test samples are predicted by training a simple single-layer linear classifier on the concatenated z.Here cross-entropy loss is used.The Adam optimizer has a learning rate of 0.001.

Performance evaluation
First, the results are compared with [8], which has the best performing 50-way classification accuracy on CUB in the multimodal few-shot learning scenario.Table 2 compares the performance for 50-way classification on CUB with [26] and [8] including Top-1, Top-3 and Top-5 accuracies.The proposed method outperforms [26] in 5-or more shots for all metrics.It also outperforms [8] at higher number of shots for all metrics.Note that both [26] and [8] embed attributes into a semantic space unlike the proposed model that uses raw textual attributes.The proposed framework performs better as the number of shot increases.The lower performance for lesser number of shots is believed to be due to the size of the latent variable and the amount of data, in this case number of shots, that is available to train the latent variable.Further experiments that follow help substantiate this claim.
Table 3 compares with other multimodal models that report 5-way classification on CUB for 1-and 2-shots.For 5-way-1-shot, the best performing model uses a combination of VAE and GAN.For 5-way 1-shot, the proposed model performs average across all methods.For 5-way 5-shot, the model performs better than the rest.The lower performance for 1-shot is believed to be due to the increased size of the latent vector as a consequence of using the raw attributes since for 1-shot, a low dimensional input could prove beneficial.Specifically, for CUB there are 312 attributes and together with the visual representation z v of 64 dimensions, the total input dimension is 376.As noted earlier,  the benefit of using the attributes directly is that it allows the model to be more interpretable and it provides a means for learning the compositionality of an image.Tables 4 and 5 show the results for the proposed model on the SUN and AWA2 datasets.There are no comparisons with other models because these datasets are used for zero-shot or generalized few-shot learning, which are not considered here.When using all the test categories for classification, there is a continuous increase in accuracy as the number of shots is increased for SUN.The same is true for AWA2 although with 1-shot itself the accuracy of Top-5 reaches 93% .These results provide a sanity check that ensures that the framework works for other datasets as well.By using raw attributes in the proposed framework, the model is able to perform multimodal few-shot learning on these datasets.As the number of shots increases, the performance of the model increases as well.
In addition to results with a ResNet-18 backbone, results using a ResNet-101 backbone are also presented in all metrics.Results with the ResNet-18 backbone enable us to show a direct comparison with [8] and [26], whose works are closest, as the same backbone is used.ResNet-101 shows the effects on the model when using a directly comparable but stronger feature extraction.Results with the ResNet-101 backbone shows that when a better feature extraction is used with the model, results improve significantly.

Compositionality
Compositionality refers to the idea of representing a whole through a representation of its parts.Human knowledge representation is largely compositional and is applicable for spatial as well as temporal phenomena, e.g., a scene as composed of objects, an object as composed of parts or an activity as composed of events.Here, the attributes are viewed as representing the composition of an image; in fact it is more than that since the attributes not only describe the parts that an object is made up of but also describe  their characteristics.Tokmakov et al. [7] describe a model to learn compositional representations for few-shot learning by disentangling the feature space of a CNN into subspaces corresponding to category-level attributes.The performance of the proposed model without attribute embedding is compared to [7] in Table 6 for 100-way classification on CUB.The authors also applied their compositional algorithm on two few-shot recognition methods (described in their supplementary material), viz., Prototypical networks (PN) and matching network (MN).As seen in the table, the proposed method is comparable for 1-, 2-and 5-shot, but starts to performs better for 10-shot.The gap between performance of the proposed method and the other models is observed to decrease as the number of shots increases.This is likely due to the increased number of shots improving how well the model learns compositionality.The results of the proposed method with a ResNet-101 backbone is also presented.This improves the model results by 3 to 9%, similar to the behavior as observed in Sect.5.3.With a stronger feature extraction, it results in a better representation of attributes for the model.

Interpretability
The proposed model is inherently interpretable because the probability of the attributes that contributed to a particular classification is readily available from the latent semantic representation z s .In order to evaluate interpretability, the estimated prob- ability of attributes is compared to the ground truth labels represented as the binary attribute vector.In Fig. 2, examples of images from the CUB dataset for which the Top-5 attributes are identified by the model is presented along with the ground truth in a 50-way setting.On the left (images 1, 2 and 3) shows results obtained from a ResNet-18 backbone and on the right (images 4, 5 and 6), results from a ResNet-101 backbone.In both setups, the model is able to predict the presence of attributes with high accuracy.For attributes that the prediction is wrong, potential reasons can be seen from the images.For example, from the Top-5 predicted attributes in image 1, the model detects the attribute "has_nape_color::white" with a confidence of 0.6376, however the ground truth indicates that the prediction is wrong.However, it can be observed from the image that the bird is surrounded by an object that is white near the nape area.This is likely the reason the model has a higher confidence for the presence of that attribute.In addition to showing results from a ResNet-18 backbone, the results for samples predicted with a ResNet-101 backbone shows significant improvement in predictions of these attributes when using a model with stronger representation power.The confidence score of each attribute rises to close to 1. From this, it can be inferred that the stronger the feature extraction is, the higher the confidence, and the more interpretable the results will be.This further shows the interpretability of the proposed model.For a quantitative evaluation of interpretability, the L1 distance between the ground truth and the estimated probability score of the attribute is computed.Table 7 shows the L1 distance averaged across all attributes over the entire training and test dataset for a ResNet-18 and ResNet-101 backbone.The numbers in the table can be directly interpreted as the number of errors per attribute.On both training and test data, the average distance over all shots is around 0.5 for a ResNet-18 backbone, which is about a correct prediction for every other attribute.For the distance calculated from the results of a ResNet-101 backbone, the distance decreases to as much as about 0.000250 for training data and 0.2 for test data.This amounts to about 1 prediction error in every 4000 attributes for training data and 1 in 5 for test data.Both results suggest that the model has in someway learned the attributes from the data, and can detect the presence of attributes.Similar to the attribute prediction shown above, the use of a stronger feature extraction results in the model becoming more interpretable.

Effect of number of attributes
In this section, the effects of the number of attributes on the results is analyzed; in other words, should all the available attributes be used thereby increasing the size of the concatenated latent representation z c ?
The number of attributes are increased by picking the first n% of the attributes as indicated by the dataset.For example, for a dataset with 100 attributes, if 10% is chosen, only the first 10 attributes of the dataset will be used.Figure 3 shows the accuracy for 1, 2, 5, 10 and 20 shots as the percentage of attributes is increased from 10 to 100%.For 1-and 2-shot, it is observed that there is an improvement in accuracy of about 5 to 10% as the number of attributes increases.For 5-, 10-and 20-shot, the accuracy decreases about 2 to 5%.This phenomenon is believed to be caused by the size of the image features.When using a ResNet-18 backbone, the extracted image features has a size of 512.When there are low number of shots, each additional sample helps improve the mapping of image features to the n% + 64 sized z c and back to the reconstructed fea- tures.As there are limited samples, each sample improves the mapping.However, as the number of shots increases, the model has to learn from more samples but still in small amounts that makes it difficult to learn a proper mapping between the different latent spaces.In Sect.5.3, the lower performance is attributed to the size of the latent vector.The results for higher number of shots here mirrors this, as the percentage of attributes decreases, the learned mapping becomes easier for the model as the size of the latent space decreases.The same cannot be said for lower number of shots as due to a smaller number of samples, any form of additional information provided to the model improves the results.Reducing the size of the latent space is still believed to result in better performance, however not in the case when binary value of attributes are used.

Conclusion
In this work, a multimodal few-shot learning method that uses image attributes directly, without the need for an embedding space, is proposed.Embedded attributes prevent the model from being interpretable.Raw attributes also help determine the composition of an image.An inherently interpretable model is proposed using a hybrid autoencoder that has both a normal autoencoder and a variational autoencoder with a semantic loss and cyclic consistency loss.This method outperforms existing methods in higher number of ways and shots on the CUB dataset with comparable results in fewer number of ways.The interpretability of the model is also evaluated by comparing the predicted attribute scores with the ground truth attribute labels, as well as show how with stronger feature extraction, the model becomes even more interpretable.
s i=1 consists of similar triplets drawn from C novel classes that contain fewer data.The query set D test = {x i } l i=1 is drawn from C novel classes.Thus D train is the meta training set and D support and D test together make up the meta testing set.In this case, u = 1 for text modality.

Fig. 3
Fig. 3 Top-1 accuracy for different shots as number of attributes increases for CUB data set

Table 1
Comparing accuracy when cyclic consistency is considered with respect to z v (hard constraint) and to µ v (soft constraint) in 50-way classification on CUB Bold values represent best performing scores in the individual categories were used.In this split, | C base |= 150 and | C novel |= 50 .Following few- shot learning methods, this results in K ∈ {1, 2, 5, 10, 20} images of C novel in the sup- port set.N ∈ {5, 10, 20, 50} way classification were performed.The image features and attributes generated are also provided in the dataset.SUN is a scene dataset with 717 classes split into | C base |= 645 and | C novel |= 72 with N ∈ {5, 10, 20, 50, 72} and K ∈ {1, 2, 5, 10} .Unfortunately, there are no results reported for few-shot learning on this dataset; instead it is used for zero-shot learning and generalized few-shot learning, which the model is not intended for.AWA2 is an animal dataset consisting of 50 classes that are split into | C base |= 40 and | C novel |= 10 with N ∈ {5, 10} and K ∈ {1, 2, 5, 10, 20} .Similar to the SUN dataset, there are no results reported for this dataset for multimodal few-shot classification.

Table 2
50-way classification accuracy on CUBBold values represent best performing scores in the individual categories

Table 3 5
-way classification accuracy of Top-1 on CUB

Table 4
72-way classification accuracy on SUN of proposed model

Table 5
10-way classification accuracy on AWA2 of proposed model

Table 6
100-way classification of Top-5 accuracy on CUB Fig. 2 Predicted probability of attributes versus ground truth

Table 7
L1 distance of predicted attribute score to ground truth labels for 50-way CUB