Online social image ranking in diversified preferences

Due to the prevalence of social media service, effective and efficient online image retrieval is in urgent need to satisfy diversified requirements of Web users. Previous studies are mainly focusing on bridging the semantic gap by well-established content modeling with semantic information and social tagging information, but they are not flexible in aggregating the diversified expectations of the online users. In this paper, we present OSIR, a solution framework to facilitate the diversified preference styles in online social media image searching by textual query inputs. First, we propose an efficient Online Multiple Kernel Ranking (OMKR) model which is constructed on multiple query dimensions and complimentary feature channels, and trained by minimizing the triplet loss on hard negative samples. By optimizing the ranking performance with multi-dimensional queries, the semantic consistency between the image ranking and textual query input is directly maximized without relying on the intermediate semantic annotation procedure. Second, we construct random walk-based preference modeling by domain-specific similarity calculation on heterogeneous social attributes. By re-ranking the rank output of OMKR based on each preference ranking model, we obtain a set of ranking lists encoding different potential aspects of user preference. Last, we propose an effective and efficient position-sensitive rank aggregation approach to aggregate multiple ranking results based on the user preference specification. Extensive experiment on two social media datasets demonstrates the advantages of our approach in both retrieval performance and user experience.


Introduction
Multimedia content searching in Web space is a very challenging task. The prevalence of social media service makes this task even harder due to the diversified user preference and heterogeneous user behaviors. Online users usually present themselves by transmitting online multimedia to their social circles and contributing user-generated content ad hoc with mobile devices. For example, users share interesting photos with rich tags and comments to friends, they would like to show where they are or what they are doing at this moment with pictures and the corresponding location information, they put tags and comments to certain images to express their feelings about the content therein, and they also categorize their favorite images into several online albums. Consequently, the online social multimedia documents, especially the huge volume of images, are associated with a lot of meta-information and social user-related attributes, e.g., location, upload time, user, and community. Despite that the huge number of social images indeed provides chance to develop models for social image retrieval, most of existing works only learn models that capture the preference towards the whole user community instead of a single user or a small group of users. As a consequence, identical results tend to be returned to online users given a specific query input, which tends to be less desirable. Therefore, effective method is required to meet the diversified preference styles among the user community.
To address this practical problem, a possible paradigm is to construct real-world image retrieval methods by content-based visual analysis [1] and semantic-based analysis [2] with the content information and co-occurred semantic information (e.g., labels and tags). For content-based analysis, retrieval models are constructed on the local (e.g., Bag-of-Visual-Word) and global (e.g., Gist and Edge histogram) visual feature representation and state-of-the-art deep convolutional neural network (CNN) features. Accordingly, the models for visual content hashing, indexing, and similarity learning are deeply investigated. For semantic-based analysis, the images and the queries (visual or textual) are projected into the multi-dimensional semantic space. The similarities between the queries and database images are calculated on the semantic space. However, for social media images with heterogeneous information beyond the visual content and semantics, the existing content-based and semantic-based approaches are not flexible to satisfy the user's true needs reflected as user preferences. For example, given a query "sunflower, " some user may prefer the sunflower taken in the wild, while some others may prefer sunflower taken in the greenhouse. When we look into the retrieved result, if the images in different environments are contained in the top ranked list, it would be better that their relative positions in the rank list for different users are different. Towards this objective, the technical challenges and our proposals are as follows.
First, existing approaches bridge the semantic gap by visual modeling or semantic annotation. But their learning procedure does not maximize a criterion directly related to the final retrieval performance. Instead, they maximize alternative criteria such as the annotation performance or the descriptive power of visual features. However, in practice, user queries are highly diversified. Despite that the criteria difference can be compensated by bridging the intention gap [3] or user interaction [4][5][6], the query-independent approaches do not directly fit well to the user needs expressed in the queries.
In this work, we introduce a method called Online Multiple Kernel Ranking (OMKR) to directly learn the image retrieval model without relying on semantic annotation. Our model adopts a learning criterion [7] related to the final retrieval performance based on discriminative learning. It takes as input a set of training queries as well as a set of ranked online social media images, and outputs a trained model to achieve high ranking performance on new queries. By combining multiple visual features and exploring the correlation among different query words, our model achieves better model generality. OMKR is also featured with an efficient online optimization procedure which builds upon the online multiple kernel learning framework [8]. Therefore, it permits learning over large training data.
Second, when users are searching online images, the queries are words expressing what the users want to search. However, in most situations, the user preference is usually not (2020) 2020: 53 Page 3 of 28 expressed in a query with several words. Instead, different users tend to give the same query words on a certain topic, but their intrinsic expectation on the returned images may be different from person to person. For example, with the query "car, " Alex would like to search car listing information specifically available in London since he lives in the city, and Tony would expect to search for car images that receive the most positive comments or car review reports, as he will buy a car very soon, while Anya would expect to find specific car images shared in certain groups when she considers to join in certain online user groups with similar interests (vintage car or refitted vehicle). Such diversified preference styles are usually unavailable in practice because it is always hard to require users to be professional and precise on describing what they truly demand. Fortunately, we can retrieve user preference by exploiting from related users and the rich context information of social media Websites, e.g., the temporal adjacency, the location affinity, the gallery information, the associated user groups, and the positive/negative comments. Based on the study of McAuley et al. [9], these social network meta-data provide an informative signal for certain image categories. Therefore, promising performance has been achieved even when the visual features are not employed for image labeling and tag prediction tasks. In this paper, we call the associated meta-data as the social attributes of online images. We construct random walk models on each social attribute and rerank the results of OMKR according to the potential preference expressed in each social attribute. Moreover, as each ranking metric captures only some aspect of the consistency with respect to certain social attribute, it is beneficial to combine different ranking metrics to accurately identify what a user really needs. The problem of rank aggregation or preference aggregation has been extensively studied in social choice theory [10]. We propose an order-based technique with the weighted position-sensitive measurement. Compared with the traditional rank aggregation models, our model achieves better rank aggregation results with low computational cost. Consequently, a set of ranking results that encode both the semantic ranking and potential preference are obtained based on the user preference specification.
To summarize, in this paper, we study the problem of social image retrieval satisfying both the semantic consistency and diversified user preference styles. To solve the above challenges, we present online social image ranking (OSIR), a direct solution framework for social media image retrieval in diversified preference styles. The model is flexible in utilizing state-of-the-art visual and textual features such as word embedding and the multiple feature layers of a deep CNN.
Our approach produces the final ranking of the retrieved social media images in a way similar to the preparation of cocktail drink, a kind of alcoholic mixed drink that contains two or more ingredients to fit the diversified user preference styles. The key contributions can be summarized as follows: (1) The Online Multiple Kernel Ranking maximizes the semantic consistency between the top ranked images and the multi-dimensional textual queries, by combining complimentary visual features and minimizing the hard negative-based triplet loss, similar as [11]. The learning procedure quickly converges in receiving less than 30 thousand triplets.
(2) The Rank Aggregation appropriately aggregates various social attribute correlations among different images. By modeling the relative importance of each top ranked image, a unified ranking that better fits user preference is obtained. (3) Experiments on real social image retrieval demonstrate that OSIR outperforms stateof-the-art. Besides, the subjective study shows that the aggregate ranking satisfies the user preference beyond semantic consistency.
Roadmap. Section 2 provides a brief literature review. Section 3 gives the framework overview. Section 4 introduces the Online Multiple Kernel Ranking. Section 5 describes the preference modeling. Section 6 presents the position-sensitive rank aggregation. Section 7 provides the experimental details and discussion. Section 8 concludes the paper.

Related work
Great effort has been dedicated to modeling different aspects of online multimedia retrieval and different user browsing behaviors for visual content retrieval. In this paper, we provide a brief literature review from the following aspects.

Visual-semantic retrieval
For decades, image retrieval has been a core research problem in multimedia research community. Research efforts have been made to bridge the semantic gap between the user queries and the multi-dimensional content representation [1]. For example, Grangier and Bengio [7] proposed a discriminative kernel-based ranking approach for image retrieval by textual queries. Rasiwasia et al. [2] constructed a unified semantic space for cross-modal data, which was based on what documents to be retrieved by queries from other modalities. Following this idea, correlation learning from multiple modalities has been comprehensively studied [12,13]. Zhang et al. [3] proposed an attribute-augmented semantic hierarchy for content-based image retrieval. Since then, fusing complementary information for visual content modeling has been a widely accepted paradigm to achieve better semantic consistency [14]. Li et al. [15] propose a deep collaborative embedding method for social image tagging, tag-based image retrieval, and content-based image retrieval.
Due to the success of deep neural network work, the deep model has been employed in image retrieval tasks. For example, Gordo et al. [16] proposed an end-to-end trainable deep convolutional neural network model for image retrieval, which has been treated as a standard CNN-based pipeline. To address the modality difference between visual and textual modalities, there are numerous work in recent years. Ma et al. [17] propose a multi-modal convolutional neural network which explicitly captures and aggregates the multi-level visual-textual component correlation for measuring the visual-textual correlation. Similarly, Lu et al. [18] propose a hierarchical co-attention model to adaptively learn the visual-textual component correlation for visual question answering (VQA) task. Recently, given the success of large-scale pretrained model (e.g., the BERT model [19]) and self-supervised learning paradigm, numerous joint visual-textual deep representation models have been proposed. For example, Lu et al. [18] propose a pretrained task-agnostic model for visiolinguistic representation which can be used as the backbone network for various vision-language tasks, such as image-sentence retrieval, image captioning, and VQA. However, these models are developed by assuming the textual modality to be a sentence, which is slightly different from our setting where the query is assumed to be several tags.
As an effective and efficient solution framework for image retrieval and text-to-image retrieval, online similarity learning [20][21][22][23][24][25] has been studied extensively. Specifically, Chechik et al. [20] develop a large-scale online asymmetric similarity learning method from ranking. Xia et al. [22] develop a multiple kernel similarity learning for visual search. An online multi-modal distance learning method has also been proposed in [23]. As a recent achievement, Wu et al. propose an online asymmetric similarity learning method for text-image retrieval which aggregates the visual features of different CNN layers. An online low-rank similarity learning method is also proposed in [25] to obtain a low-rank similarity parameter matrix for measuring similarity between image and text. Despite the effectiveness of using hard negatives for retrieval model learning [11], it has not been considered in the context of online learning for text-to-image retrieval. Based on active learning and relevance feedback, the intention gap can be effectively reduced [3,26]. Fan et al. [5] proposed a personalized image recommendation via exploratory search modeling. Tian et al. [4] proposed an active re-ranking approach. Zhang et al. [27] propose an active learning method for image classification, which indicates if an image should be labeled by states in the generative adversarial network. On mobile platforms, Wang et al. [28] proposed an interactive mobile visual search with multi-modal queries. However, they either assume the active learning process contains less user preference, or assume that the user preference is obtained by interactions. In contrast, our approach directly fits images into the query space, and the intention gap problem is naturally avoided.
Enforcing diversity in retrieved content has become an important research issue in recent years [29][30][31]. Generally, from the visual content perspective, the top ranked retrieved images are expected to be as diversified as possible so as to deliver richer content information under the same semantics. For example, Ionescu et al. [29] propose to enhance the diversity of the social image dataset by multiple technical treatments, e.g., machine analysis, human-based computation, or hybrid approaches. The semantic relevance and diversity, nevertheless, are considered to be somehow contradictive in existing solutions. A supervised relevance scoring approach was proposed in [31] to re-rank the social images by optimizing the utility function that jointly considers the two issues, and finally, a better trade-off between relevance and diversity can be achieved. Wu et al. study how the diversity affects user satisfaction in image search [30]. Specifically, when users want to collect information or save images for further usage, more diversified result lists lead to higher satisfaction levels. The insights may help to design better result ranking strategies and evaluation metrics. Besides, diversity is also enforced in other applications such as image recommendation [32], movie recommendation [33], and general purpose recommendation tasks [34]. Similar as the retrieval task, it has been shown that more diversity can bring user with better experience.

Modeling social context
Online multimedia documents are believed to be correlated to each other on different aspects where such context information is delivered by their meta-data. The context and correlation usually have strong relevance to their semantics. McAuley et al. [9] showed that image labeling with mere social media meta-data performed equally or even outperformed visual content modeling method.
In existing study, the knowledge discovered from context has been employed in many recommendation tasks. For example, as a similar task with retrieval, the friend suggestion/recommendation aims to recommend friend to users according to the similarity (2020) 2020:53 Page 6 of 28 between friend candidates and targeted user. The user similarity can be made by joint content and context analysis [35]. The techniques can also be used for other new tasks. Based on the photographing behavior from the user crowd, Yin et al. [36] developed a socialized mobile photography model to suggest the optimal view enclosure (composition) and appropriate camera parameters by comparing the visual similarity of the query scene and the social image database with diversified photographing styles. Heterogeneous user behaviors can be modeled by the social context of online social media and effectively combine the multi-aspect behavior similarities by multiple kernel learning towards friend recommendation, advertisement, and people searching [37]. Our approach captures the potential preference styles from heterogeneous social attributes. Consequently, the user expectation on the retrieval results can be conveniently expressed by weight specification.

Ranking aggregation and refinement
Rank aggregation [38] has been recognized as a key technology for Web-based applications. The necessity to meaningfully aggregate preference ranking into a joint ranking has been deeply investigated to provide information fusion from multiple sources and diversified social choices. Rank aggregation is specially useful in crowdsourcing [39], where different users/annotators produce ranking lists with diversified results. From methodology perspective, Prati [40] proposed to combine feature ranking algorithms through rank aggregation. Ding et al. [41] propose a hierarchical ranking aggregation method. An iterative ranking aggregation method is proposed in [42] using quality improvement of subgroup ranking. Liang et al. [43] propose a manifold learning method for rank aggregation.
In multimedia research domain, Tian et al. [44] proposed a ranking SVM-based approach to identify the best ranking from a number of candidate ranking lists for image re-ranking. Yeh et al. [6] developed a personalized photograph ranking framework with various visual aspects. Zha et al. [45] constructed a probabilistic model for product ranking with hundreds of aspects. Motivated by social choice theory, a supervised Kemeny rank aggregation was proposed to aggregate multiple rankings with different credibilities [46]. Dalal et al. [47] developed a globally consistent multi-objective ranking based on Hodge decomposition. Klementiev et al. [48] proposed a probabilistic distance-based model. Our rank aggregation approach considers the relative importance of the position of a document which appears in a rank list, while existing approaches usually treat the rank of each document without discrimination.
Rank aggregation has also been used in other research topic such as person reidentification [49] and POI ranking in spatial-temporal data mining [50]. However, the time complexity of existing rank aggregation is generally prohibitive, which hinders rank aggregation to be applied to a wider range of application scenarios.

Method
The aim of OSIR is to provide an aggregate image ranking results given user query inputs. The ranking is expected to achieve better consistency in semantics and the preference styles of the users. The framework is illustrated in Fig. 1. OSIR is essentially composed of the following key steps: Online Multiple Kernel Ranking. We propose an Online Multiple Kernel Ranking (OMKR) approach by minimizing the hard negative-based triplet loss to rank the images Fig. 1 The framework of OSIR. The blue arrows denote the data flow and their ranking results. The green arrows represent the support from the database and algorithm. Given a user query and preference weight specification, first, the ranking list with the semantic ranking function is learned by our OMKR model. Then, the semantic ranking is fed into the preference modeling, and we obtain a set of preference ranking lists by preference re-ranking. These ranking lists including the semantic ranking results are aggregated with our position-sensitive rank aggregation technique. The ranking results that aggregate both semantics and preference information are finally returned to the users according to their semantic consistency with the multi-dimensional textual query input. Compared with the existing approaches, our model directly fits the images into the query space. The better semantic consistency is achieved by combining complementary visual features. We design an online learning procedure which quickly optimizes the ranking model with a large number of training triplets where the negative samples in the triplet are selected from the most similar ones to their positive counterparts, and the model quickly converges in receiving less than 30 thousand triplets. Consequently, we learn a set of semantic coherent projections which map each image into a low-dimensional semantic space where the relevance between the queries and the database images can be directly calculated by inner product.
Preference modeling. We construct random walk models on each social attribute of social media images. By using domain-specific knowledge, the social attribute correlations among different images are properly measured. We re-rank the semantic ranking respectively based on each of the preference models. Thus, a set of ranking lists encoding different potential aspects of user preference can be obtained.
Position-sensitive rank aggregation. Based on social choice theory [10], we propose a position-sensitive rank aggregation model to measure the relative importance of the top ranked results given the user preference specification. By aggregating the semantic ranking and preference ranking results, a unified ranking is obtained to achieve better consistency in both semantics and the user preference styles of the users.

Ranking model
An image database is represented as D ∈ R N×V where N denotes the number of images and V denotes the number of feature dimensions for each image. We denote each image as d ∈ R V which represents a row of D. We represent the textual query as an M-dimensional real value vector q = (q 1 , ..., q M ) ∈ R M where there may be multiple non-zero entries for multi-word query input. The score function F w (q, d) of an image d from D can be written as: For each query q, suppose we have collected the ranking information (relevant or irrelevant) of the images in the database D. In this paper, the queries are assumed to be closed set, i.e., the number of query words is fixed. It is possible to extend the query to process those queries that are even semantically unrelated to the training set. For example, we may resort to the latent topic modeling methods which use linear/deep mapping functions to process the BOW features of the queries and derive the latent representation; then, the rank model can be constructed based on the topic level instead of the word level. To deal with unseen query words, we may also use more recent methods such as word embeddings to produce an aggregated multi-dimensional query representation.
Another important issue is the relevance/irrelevance score used in this paper. In general, one image is considered to be relevant if it contains visual content describing even one query word. Extending the relevance score to multi-level case would result in the usage of other ranking loss function such as the list-wise ranking loss.
Based on the above definition, we organize the data into a training triplet set D tr where each triplet is represented as (q, d + , d − ) ∈ D tr . The ranking function learning is equivalent to minimizing the following primal ranking SVM (RSVM) objective function [7]: where w = w T 1 , ..., w T M denotes the concatenated discriminative model parameter vector. We can introduce any kernel function κ : X ×X → R for calculating the similarity among images in high-dimensional space. Consequently, the discriminative functions f m , m = 1, ..., M and the score function F(q, d) can be represented as: When the similarity among images is represented by multiple kernels κ g , g = 1, .., G, according to the representer theorem, the discriminative function and score function are formulated by [51]: (2020) 2020:53 Page 9 of 28 By introducing the Lagrangian and Karush-Kuhn-Tucker (KKT) condition, we obtain the following dual problem: where α ∈ R M×|D tr | , β ∈ R M×G , and mg ∈ R |D tr |×|D tr | is a positive semi-definite matrix where: To efficiently learn the ranking model, a large number of training data should be involved. The number of training triplet is approximately O |T ||D| 2 where |T | denotes the number of textual queries for training and |D| denotes the number of images in the database. Consequently, to optimize the dual problem in Eq. 5, the prohibitive size of memory is required to load and maintain all the mg . To efficiently handle big data, we propose an online optimization procedure to optimize the multiple kernel ranking models, which will be introduced later.

Hard negative-based online learning
The value of hard negatives in learning machines was studied in depth in [52], where the samples in negative class that are the most similar to the single positive sample, given the classification hyperplane, are considered to be useful and informative. For many fundamental vision and multimedia tasks, for example, object detection [53] and image-text retrieval [11], mining the hard negatives during the training process will significantly boost the learning performance for both shallow model [54] and deep model [11,53].
In this paper, we aim to improve the online learning by hard negative mining. Specifically, let us denote the hard negative asd − , and then the notion of training triplet becomes q, d + ,d − . One can easily replace the hard negative triplet into the original objective function from Eqs. 2 to 5. However, how to quickly select the hard negative samples for training remains a technical challenge. Specifically, the original hard negative samples are defined as those samples within a small circle area centered by the query points. However, in our model, the query sample and the database samples are in heterogeneous, which makes it hard to directly identify the hard negatives given a query input. Besides, according to study in [11], the hard negative samples need not be identified very accurately; otherwise, it will lead to prohibitive computational consumption.
To deal with this issue, considering the kernelized formulation in Eq. 5 and the triangle inequility theorem, we design a hard negative search method based on kernelized locality sensitivity hashing (KLSH) [55], which is an approximate nearest neighbor search Page 10 of 28 method in kernel space. We build one KLSH for each feature channel, where each image is encoded into an R-bit binary code. For M feature channels, each image is represented as an MR-bit code sequence with respect to M features/kernels. To guarantee the recall rate, we can also build more than one hash table for the training image data, e.g., 3 hash tables. Based on the KLSH system, we perform hard negative mining for generating training triplets as follows. First, given a textual query q, we first identify the positive sample d + which is provided in the training dataset. Then, we treat d + as query to the KLSH system, and the samples with the same hash codes as the query are returned as the candidate hard negative samples. We remove those images from the candidate set with at least one class label that are identical to d + , and the remaining samples that share no labels with d + but are highly similar to d + can be treated as the hard negatives. Based on this scheme, we can quickly identify a set of hard negative-based training triplets given a query and a positive sample.

Online model learning
The proposed Online Multiple Kernel Ranking (OMKR) algorithm is based on the fusion of two online learning methods: the Perceptron algorithm [56] and the Hedge algorithm [57]. Particularly, for each kernel and each textual query dimension, the Perceptron algorithm is employed to learn a kernel-based classifier with some selected kernel, and the Hedge algorithm is used to update their combination weights.
In this framework, we use θ t mg to denote the combination weight for the gth kernel classifier of mth query dimension at round t which is initially set to 1. For each learning round, we update the weight θ t mg by following the boosting style Hedge algorithm where each discriminative function can be treated as a weak learner. The weight update rule can be formulated as: where σ ∈ (0, 1) is a discount weight parameter which is employed to penalize the kernel classifier that performs incorrect prediction at each learning step, and z t mg indicates that if the gth kernel classifier of the mth query dimension makes a mistake on the prediction of the training triplet (q j , d + j ,d − j ), namely, q mj f mg d + − f mg d − ≤ 0. When the tth training triplet is incorrectly predicted on the mth query dimension and the gth kernel, the corresponding discriminative sub-model is updated as: The main procedure of the optimization process is summarized in Algorithm 1. A support vector shrinking process is also performed in every T b iteration to safely remove the training triplets with very high score using current model, i.e., the false positive support vectors, to enhance the efficiency of the learned model. Our model is similar to the boosting models where each of the gth kernel classifier on mth query dimension can be seen as the "weak learners. " A weak learner selection procedure is performed which identifies a set of relevant discriminate weak learners with respect to the non-zero dimensions of the multi-dimensional query. The weight of each weak learner is updated according to their performance on the training triplet. The expected complexity of model update when receiving one training triplet is O 2MGC κ where M denotes the average number of non-zero dimensions in the query sets. For single query input, the complexity of per-step . Under the non-hard negative setting, we provide Theorem 1 to estimate the error bound of our model.

Theorem 1 After receiving a sequence of T training triplets, denoted by
.., T , the number of mistakes made by running Algorithm 1, denoted as: is bounded as follows: , we have: where H mg denotes the structured loss on each individual classifier f mg as: As indicated by [8], the proof can be made by essentially combining the proof of the Perceptron [56] and the Hedge algorithm [57]. The details are omitted. Theorem 1 indicates that the error bound of the discriminative function is substantially determined by the error of the best weak learner. The error bound in the above theorem can be further improved from two aspects. First, it can be improved if we further tune the step-size or the margin. Second, it is further improved if we apply hard negative-based training scheme, since the error of the best weak learner can be further reduced by minimizing the hard negative learning objective function. For large-scale application, our proposed OMKR model needs to traverse the training data once or very limited times. Consequently, given a textual query q , we obtain a rank list τ 0 from the image database which reflects the semantic consistency between the query input and each image.

Discussion
Our model can be considered as a projection learning approach which learns an Mdimensional semantic consistent and query dependent representation for the image. The similarity between query and images can be directly compared by the simple inner product operation. We construct the projection function for each dimension by combining multiple visual features and exploring the correlation among different query dimensions. When the number of query dimensions is high, we can use latent topic models to learn Receive a training triplet q t , dimensional compact representation on each query, and then, an Mdimensional OMKR can be constructed on the latent topic representation of each query. Hence, the model complexity can be well controlled.
Our model can also be seen as a combination of Perceptron [56] and boosting-like learning methodologies [57]. From the perspective of Perceptron, the gth kernel of the mth query dimension from the jth training triplet (i.e., q mj κ g d + j , d − κ g d − j , d ) can be seen as a "virtual" training sample that can be used to minimize the structure loss. Such "virtual" training sample is selected to update the model in boosting-based learning style, where the step-size is determined by the prediction of the current weak learner.

Preference modeling
When users are retrieving online images, their expectations with respect to the ranking results are diversified. Such potential interests can be exploited from the rich context information of social media Websites, e.g., the file upload time, the location of the image, the gallery information, the group of users that are interested in certain images, and the comments of each image. Based on the social attributes of the social media images, we construct correlation models to rank the social media images according to each social attribute.
For each image d j , we have collected R types of social attribute features denoted by s j . We define a set of social attribute relation matrix over each social attribute type as: (2020) 2020:53 Page 13 of 28 We construct R random walk models on each of the social attribute relation matrix, where the stable distribution can be calculated by iterating the following operation until r k is converged: where the stimulation vector ρ 0 k can be identified by setting the dimensions of the top results of the semantic-based ranking results by OMKR or other possible user interests with larger weights, e.g., the historical retrieving records of the users or the popularity score of the images. We can also set the images with the most view counts as the stimulation of the random walk models. In fact, the non-zero weights represent the prior knowledge on the probability of those images that are likely to be both semantically relevant and also popular among the user community. Note that the social attributes of certain images may be missing. Each image may only have a fraction of social attributes, while others are vacant or unavailable during the collection stage. Therefore, M k , k = 1, ...R are sparse and some rows and columns of M k are zero which makes the corresponding probabilities become 0. To avoid this, we assign non-zero weights to all the images in the stimulation vector ρ 0 k . When 0 ≤ μ ≤ 1, each r k is converged as: The rationality of constructing random walk model on social attributes can be explained by two folds. First, many studies indicate that information propagation in social media can be modeled by random walks [58]. Second, online users act similarly with other users with the same social behavior. Therefore, their inclinations can also be propagated along the social attribute correlations of the online documents. In this paper, we are primarily interested in the following social attributes: Surrounding text. The surrounding text includes photo's title and short description carrying important indication of the semantic information. We measure the similarity of image i and j by calculating their cosine distance on the TF-IDF representation of the surrounding text. Besides, we can also use the word embedding to derive more effective surrounding text description. Specifically, we use GloVe [59] to represent each word into vector, and use a simple average pooling over the whole text to derive the final surrounding text features. Then, the similarity between image i and j with respect to this feature is calculated by Gaussian kernel.
Location. The location information indicates where an image is taken. Intuitively, if the locations of two images are close enough, their contents may deliver consistent semantics, e.g., the same objects, the same buildings, or the same scenery. The location attribute may also reflect the geo-trend that can be used to detect the local interest and location-aware topics [60]. We use RBF kernel to measure the location relevance where the similarity of geographically adjacent images is higher.
Time. The upload time of images may indicate the temporal relation of images describing hot social event. For example, the images describing the American President Election will be posted by online users frequently within a certain period. Moreover, (2020) 2020: 53 Page 14 of 28 users would like to retrieve images in certain temporal range. We use RBF kernel to measure the temporal relevance on several temporal resolutions, e.g., year, month, and day, respectively. Group. Similar as the category, images are associated with groups where each group is associated with uploaders' description of the semantics [9]. We collect the group information of each image and denote image i and j as relevant (M k (i, j) = 1) when their associated group information is identical.
Category. On many social media Websites, semantically related images are grouped into categories by online users. Each image may be categorized into multiple categories where each category has a unique category ID. We collect the category information of each image. We denote image i and j as relevant (M k (i, j) = 1) when at least one of their associated categories is identical.
User ID. The images uploaded by the same user ID may convey certain preference styles. For example, some users may be interested in the photo capturing style of specific online users. For image i and j uploaded by the same user ID, we denote them as relevant (M k (i, j) = 1) when constructing the social attribute relation matrix.
Based on preference score [r 1 , ..., r R ], we obtain a set of ranking lists [τ 1 , ..., τ R ] by injecting semantic ranking results into correlation modeling of social attributes. We will introduce how to effectively aggregate τ 0 and [τ 1 , ..., τ R ] in a unified rank aggregation model in the subsequent section.

Rank aggregation
As each ranking metric captures only some aspect of the consistency with respect to certain social attribute, it is beneficial to combine them in order to more accurately identify what a user really needs. Our proposed rank aggregation model is an order-based technique with the weighted position-sensitive measurement.
For P images from the database, we have R + 1 ranking lists τ r = [τ r1 , ..., τ rP ] , ∀r = 0, ..., R. We define a pair-wise preference matrix Q i,j , ∀i, j = 1, ..., P which encodes if the ith image is preferred over the jth image by considering their ranking in all the ranking lists and the weights of individual rankers ω = [ω 0 , ..., ω R ]. When performing the retrieval given a user query, the documents are ranked according to their relevance. However, when the number P of documents is large, the user experience will be determined by the top P ranked results where P P. In traditional Kemeny ranking aggregation procedure (or its weighted extensions) [46], we only need to calculate the preference score of the top P documents as: where τ ri , r = 1, ..., P indicates the ith document index in the rth ranking list. However, some documents that are ranked in the lower position in many ranking lists can be inappropriately ranked at a higher position, because the relative importance of the top ranked documents in the ranking lists is not adequately emphasized, and their relative importance over the lower ranked documents has not been sufficiently observed in the top P results. To alleviate the disadvantages, we revise the rank aggregation model from two aspects. First, the preference score is calculated as: where ∈ R + is a sensitive parameter controlling the range position importance and where a small indicates the heavier relative importance on the top ranked documents. The relative measurement ensures the top ranked documents are more carefully pondered when their positions are considered to be changed in the aggregation procedure. Second, we collect more relative importance evidence by extending the observation range P to ψP , where ψ > 1 and ψP ≤ P. We further consider the relative importance between the top P documents and the (P + 1)-th to the ψP -th documents, and encode their relative preference into Q as: Our rank aggregation model can be seen as building up a "barrier" between the top P documents and the bottom ranked images in order to refine the top P by collecting more observations and prevent the bottom ranked documents to be ranked at a high position. A simple toy example is demonstrated in Fig. 2. We have the following theoretical analysis on our method. [46] requires that if there is any partition {L, R} of a ranking list τ for any d i and d j , such that a majority of rankers prefer d i ∈ L to d j ∈ R , then the aggregate ranking should prefer d i to d j .

Definition 1 The Extended Condorcet Criterion
Theorem 2 Let τ be the final aggregation of the positive sensitive rank aggregation procedure. Then, τ satisfies the Extended Condorcet Criterion with respect to the input rankings [τ 0 , τ 1 , ..., τ R ].
The proof of Theorem 2 follows directly from the Theorem 4.1 in [46], and the details are omitted in this paper. Our proposed rank aggregation approach satisfies neutrality, consistency, and the Extended Condorcet Criterion. The procedure of position-sensitive Page 16 of 28 rank aggregation is described in Algorithm 2. Similar as the Kemeny optimal aggregation, our proposed rank aggregation model also has a good maximum likelihood interpretation or even better, because we collect more observations of pair-wise preference in our framework. The rank aggregation result possesses the following properties: first, a more semantic consistent rank list compared with only using the semantic ranking results; second, it well satisfies the user's preference. The complexity of the ranking procedure is O 1 2 P(P − 1) + ψP 2 +O (R + 1)P log P where the former is the complexity of calculating the preference matrix Q and the latter is the complexity of Quicksort on the P images. In this paper, we empirically set ψ = 2 for all the experiments.

Experimental results
In this section, we perform systematical evaluation on two real-world social media datasets on social media image retrieval task.
Datasets. The datasets we used in this paper include the following: (1) The NUS-WIDE dataset [61] consists of 269,648 images collected from Flickr. We collect their social attributes by using their URLs linked to their original pages. Six types of low-level visual features are provided by the data provider. The 81-dim tag vectors of images are treated as the ground-truth queries. (2) The Flickr dataset consists of 3.5 million images collected from Flickr covering wider visual topics than NUS-WIDE. We extract 5 types of visual Similarly, we collect the same social attributes by using their URLs linked to their original pages. We select 150 common queries from the query vocabulary as the associated ground-truth queries. For both datasets, we reweigh each query dimension by TF-IDF weight to enhance the descriptive power of query inputs, and use the weighted value for each query dimension when it occurred in a query input. Besides, due to the strong representation ability of the deep convolutional neural network, we also extract deep visual features using the standard VGG-19 network pretrained using ImageNet dataset. Inspired by the feature extraction strategy in [24], we use conv2, conv4, conv5, fc6, and fc7 as the visual features that complementarily describe the visual content from low-level to highlevel semantic abstraction. To deal with high dimensionality, we perform PCA on these deep features. Data partition. For NUS-WIDE data, we randomly select 15 thousand images as the training database and another 2 thousand queries as the training queries. We select 5 thousand queries from the remaining dataset as the testing queries, and the other images excluding the training database and testing queries are used as the testing database. Note that we do not use the training/testing partition provided by the NUS-WIDE data provider. Our scheme is more suitable for evaluating the model generality of ranking model learning using small number of training data and testing database with larger size. For Flickr data, we randomly select 50 thousand images as the training database and another 6 thousand queries as the training queries. We select 10 thousand queries from the remaining dataset as the testing queries, and the other images excluding the training database and testing queries are used as the testing database. Note that despite the unified query vocabulary used on both training and testing datasets, the textual queries of the training dataset and testing dataset are still diversified and not enforced to be arranged to be the same. This setting tends to be more practical and is able to verify the generalization of the compared approaches.
Compared approaches. We compare the following approaches for the task of query-toimage retrieval and personalized social media retrieval. Note that for shallow model, we use both the shallow features and the deep features of the pretrained VGG-19 network for comparison.
(1) PAMIR-PR: PAMIR is a kernel-based discriminative text-to-image retrieval approach proposed by Grangier and Bengio [7]. We use the average kernel on the handcrafted shallow features for training PAMIR, and re-rank the text-to-image retrieval results by PAMIR using the preference ranking proposed in this paper.
(2) MMNN-PR: MMNN is a state-of-the-art cross-modal hashing approach proposed by Masci et al. [13] using multi-layer neuro-network. We conduct dimension reduction to the concatenated visual features to reduce the feature dimension number to 300. We set the code length as 64 for the hash code learning, and re-rank the text-to-image retrieval results by MMNN using the preference ranking.
(3) SCM-PR: SCM is a semantic correlation model proposed by Rasiwasia et al. [2], which projects the text documents and image documents into a unified semantic space. In this paper, we only project the images into the semantic categories where the number of category is equal with the number of query dimension. We do not project the query text into the semantic space, since the query text is extremely sparse. We re-rank the text-to-image retrieval results by SCM using the preference ranking. (4) CMOS lg -PR:CMOS is an online cross-modal retrieval method [24] which learns the asymmetric bilinear similarity by aggregating deep visual features from multiple layers. Among the three layer aggregation mechanisms, we report the results derived by layer gating, and use the retrieval results to re-rank.
(5) Deep-PR: It is expected that deep models trained in an end-to-end fashion can be a strong competitor for various tasks including the text-to-image retrieval. In our study, considering that the textual queries are combination of words for the social media datasets, traditional deep learning models for text cannot be directly applied in our situation. Instead, given that the size of query vocabulary is fixed, we use an FC layer to transform the query vector into a K-dimensional representation, and use the conv1→fc6 of VGG-19 for visual feature extraction, where the parameters of VGG-19 are pretrained with ImageNet dataset. Then, the fc6 layer is connected to the K-dimensional representation as well to ensure that the similarity of image and textual queries can be measured. We train the model with the hard negative triplet loss as VSE++ and our model to guarantee good accuracy. We implement our model using both hand-crafted visual features and the extracted deep CNN features as has been described. We test different versions of our model.
(1) OSIR s : A simplified version of our proposed approach where the kernel weight for all the query dimensions is identical in the OMKR learning (OMKR-sim), i.e., we only need to learn β g and f g for all the query dimensions. The ranking function is: (2) OSIR: Our proposed approach which learns f mg and β mg , where m = 1, .., M and g = 1, ..., G.
Evaluation criteria. To measure the performance of query-to-image retrieval, we adopt the mean average precision (MAP). For subjective study, we conduct evaluation on a three-level score human evaluation of the preference aggregation results by normalized discount cumulative gain (NDCG): (1) 2-preferred and semantically relevant; (2) 1-semantically relevant; and (3) 0-irrelevant.

Online learning of ranking models
We conduct experiment to study the ranking model training with respect to the following aspects. We evaluate three methods, i.e., PAMIR, OMKR-sim in OSIR s , and OMKR in OSIR, since they are all online ranking models. All the experiment results in this section are reported on NUS-WIDE data.
Training error curves. We record the training error curves for each method in Fig. 3. The training error of tth iteration is calculated by dividing the number of disordered training triplets (i.e., q t F d + t − F d − t ≤ 0) with the number of total training triplets at tth iteration. From Fig. 3a, we observe that our approach achieves much lower training error after receiving the first 5 thousand training triplets, and the training error continues to decrease more quickly than the other two approaches when receiving more triplets.

Fig. 3 The training error rate curves
The passive-aggressive learning procedure used by PAMIR possesses similar convergence property as our online optimization procedure. At the first 1 thousand training iterations, the training error of PAMIR tends to be more unstable than the other two, as shown in Fig. 3b. The lower training errors can be explained by the fact that our OMKR has more weak learners with respect to different query dimensions and different kernels. Therefore, the results indicate that OMKR possesses lower model bias and is more likely to converge to the "ideal model. " Number of support vectors. After receiving T training triplets, the number of support vectors (i.e., triplets with non-zero weights) of the online ranking methods determines both the model complexity and generality. We record the ratios of support vectors after receiving T triplets in Table 1. OMKR has the most compact support vector sets among all the approaches, which means OMKR ranks the training triplets more correctly. Therefore, less training triplets are incorporated as support vectors. Another reason is the shrinking operation in Algorithm 1 which can generally filter about 0.15T b support vectors each time.
Number of training data. We evaluate the potentials of the three models when increasing the number of the training queries and the number training triplets. The results are shown in Fig. 4. In Fig. 4a, we randomly select 512 training queries to generate different numbers of training triplets, and randomly select 3 thousand test queries to measure the MAP with respect to different numbers of triplets. In Fig. 4b, we fix the number of training triplets generated on each training query as 300, and select different numbers of training queries to train the three online learning models. We evaluate MAP on the same test queries as in Fig. 4a. From the result curves, we observe that the ranking performance can be enhanced by increasing both the number of triplets and the number of different queries. Comparatively, increasing the number of queries tends to produce higher performance gain, since the ranking models benefit from capturing more patterns in the query inputs. The performance gain of OMKR by increasing the training data is higher than the other two models.
Training time. We record the training time of the three methods by using 50 thousand training triplets in Table 2. The training time consumptions of the three methods are mainly determined by the numbers of support vectors and complexity of kernel calculation. Although OMKR has more weak learners with respect to each kernel and each query dimension, its time consumption does not grow very significantly since the model has lower ratio of support vectors. The time efficiency of OMKR can be attributed to its model generality and the support vector shrinking in Algorithm 1. Moreover, it can be observed that the model using deep features consumes more time than using shallow features, because of the fact that it takes more time to calculate kernels using deep features with higher dimension.
The kernel weight learning mechanism. One of the most enjoyable properties of OMKR can be attributed to its query word-specific weighted kernel combination. To evaluate how the kernel weight learning scheme works, we calculate the accumulated kernel weights with respect to each feature channel and each query dimension, respectively, in Fig. 5a, b. The results are reported on the NUS-WIDE data, while similar observations can be found on the Flickr data. In Fig. 5a, the accumulated weight of BOW feature is larger than other global features. Color histogram (CH) performs the worst. Its accumulated weight is smaller than others. The accumulated weight of edge histogram (EDH) is the second larger because the texture statistics delivered in each image is informative in identifying visual objects. The result is consistent with the empirical judgment on the feature effectiveness.
In contrast, the accumulated weight distribution with respect to different query dimensions is much more imbalanced, as shown in Fig. 5b. Some query dimensions possess much larger accumulated weights, e.g., swimmers, computer, whales, elk, and earthquake. The reason may be three folds: (1) the query dimensions with higher weights are easy to be distinguished, (2) there are many images having the query words which produce a larger number of training triplets, and (3) these queries usually co-occurred with other queries which borrow more discriminative information from other rankers. Results in Fig. 5b can also be considered as an informative query word selection procedure which identifies the most important query words on a large-scale social media dataset.

Retrieval performance
We perform extensive experiment to evaluate the retrieval performance of all the compared methods. The MAP measurements on top 500 retrieved results of different methods are shown in Table 3. We denote the retrieval results by semantic ranking of the original retrieval method (PAMIR, MMNN, CMOS lg , Deep, OSIR, etc.) as SR, and rank aggregation with both SR and the re-ranking results of the surrounding text as SR+BOW and SR+GlV, which represents using Bag-of-Word or GloVe embedding for extracting the feature of surrounding text, respectively. Similarly, SR+lc is used for location, SR+tm for time, SR+gp for group, SR+ctg for category, SR+id for user ID, and SR+all for the weighted aggregation using SR and all the preference ranking lists. By appropriately aggregating the semantic ranking and preference ranking results, our approach achieves much better retrieving performance than other approaches on both datasets. First, the results indicate that different social attributes carry different implications on the true semantics of the social media images. For example, by aggregating SR and location (SR+lc), the retrieval performance of all the compared approaches is improved over SR. By aggregating SR and user ID (SR+id), our approaches consistently obtain improved results, while other approaches may perform worse on either NUS-WIDE or Flickr. The upload time information is less relevant to the semantics, since the results by aggregating SR and user ID (SR+id) usually underperform results on SR.
In general, from the results, we observe that, among all the social attributes considered in this paper, the attributes with higher semantics, despite being noisy in some case, tend to produce better results in refining the results. For example, among all the social attributes, the social tag (ctg) and surrounding texts (GlV with GloVe feature) tend to be the most effective attributes for re-ranking performance enhancement. The reason is straightforward, i.e., incorporating the affinity structure into the re-ranking model can be seen as introducing more semantic information expressed by different users, so that the re-ranked results can be significanlty improved and better reflect the user preference.
On the other hand, introducing social atrributes that are less semantically relevant would introduce inappropriate relation information among images. For example, the uploading time of different images is not a semantic-related feature, which only encodes the temporal co-occurrence pattern of different images. For some case, these adjacencies do reflect certain cues on popularity. For instance, an image tends to be more popular if it is uploaded next to a very popular image, and a group of images uploaded (2020) 2020: 53 Page 22 of 28 by the same popular user on the social image Website tends to be more popular than images from other users. However, in other case, these adjacencies reflect nothing, mainly due to the low correlation with the true semantic meaning. Therefore, the performance enhancement brought by using these social attributes tends to be less statistically significant. Second, the results imply that by preference ranking and aggregating, the performance of all the semantic-based models can be enhanced by incorporating multiple heterogeneous social attributes. For example, the performance of all the approaches on SR+all outperforms SR on both datasets. Generally, by aggregating more social attributes, the retrieval performance of all the methods on SR+all outperforms rank aggregation with single social attribute, e.g., SR+ctr.
Third, we observe that different social attributes contribute differently to different types of queries. Specifically, we observe that if the query contain words indicating location information, the re-ranked results may be better refined. In contrast, the frequently occurring query words generally do not contain words from time, user, and group attributes, so that these attributes tend to perform equally for most queries.
Last, the rank aggregation results on Flickr dataset shows that, when processing largescale social media with weak semantic information such as the noisy tags, fusing the (2020) 2020: 53 Page 23 of 28 semantic relevance delivered in different social attributes will boost the retrieval performance in a more promising manner. Such a claim is made by observing that all of the compared approaches perform at least 15% better on SR+all vs. others on Flickr dataset. Despite that some attributes may even lead to a performance degradation compared to the original semantic-relevance ranking results, but the average aggregated results still tend to be better, due to the robustness of our rank aggregation technique. Failure case. We provide some discussion from the failure case. We observe that the semantic ranking accuracy imposes direct influence on the final re-ranked results. If the truly relevant images are not ranked at top 10 positions, then the re-ranking would also fail or even push the truly relevant images backward for a small number of queries. Further study is required to address this issue to ensure better robustness of the rank aggregation.

Subjective study of preference fitting
For subjective study, given a specific query, the users are served with the same set of results without any post-processing. In offline evaluation situation, it is unable for us to provide any tailored results for subjective study because we do not have any user preference information for a specific subject. In fact, the key idea that we conduct the subjective study is to provide as many ranking choices as possible to users, and see which ranking result they would prefer. This may be a little different from the traditional view of recommendation, where the user preference has to be obtained for measuring the useritem similarity for recommendation. In our study for retrieval, if the provided top ranked results are more diversified but staying as semantically relevant, the results may be more appreciated by as many users as possible.
To this end, we randomly select 100 queries from both datasets, and ask ten normal users to provide weight specification on social attributes, and judge whether the returned aggregate ranking results can better reflect what they really like. The evaluation results are recorded in terms of NDCG@50 in Table 4, where SG denotes rank aggregation with single social attribute, and MP denotes rank aggregation with multiple social attributes. Experiments show that our approach better facilitates the diversified preference styles of online users, as it outperforms all the other approaches under different settings. The promising performance can be attributed to the good semantic retrieval performance and the position-sensitive rank aggregation that protects the top ranked results to be appropriately located in the final ranking.

Parameter sensitivity
The weight ω in rank aggregation. The setting of weight ω determines the rank aggregation performance. Existing approaches estimate the weights of multiple ranking results according to their retrieval performance with respect to certain criterion such as MAP [46]. We adopt similar tuning procedure by a validation process. Consequently, we set ω 0 = 1 in any type of rank aggregation. The results in Table 3 are based on the following setting: On NUS-WIDE data, ω 1 = 0.8 for SR+txt, ω 2 = 0.5 for SR+lc, The penalty C of online learning. We empirically set C |D tr | = 1 for PAMIR, OMKR-sim, and OMKR, since the setting guarantees good model generality.
The kernel coefficients of OMKR. According to Theorem 1, the performance of OMKR mainly depends on the performance of the best learner. We conduct a cross-validation process to tune the kernel coefficients. Details are omitted due to space limit.
The weight μ of preference modeling. This parameter determines how well the semantic ranking results is re-ranked towards the preference consistency. When μ is small, the preference re-ranking is similar to the semantic ranking which means that rank aggregation is unnecessary. When μ is large, the preference re-ranking tends to be cluttered. A reasonable setting of μ is [0.4,0.6]. In all the experiments, we set μ = 0.5 for better trade-off between semantic divergence and consistency.

Findings and discussions
Retrieval examples. We provide some examples on NUS-WIDE data in Fig. 6. Given each textual query, we show the top 10 retrieved images with respect to semantic ranking and different preference styles, where each row indicates the corresponding ranking results. The semantically relevant images are marked with red dots. Although all the top ranked results are semantically relevant, their preference ranking tends to be diversified. For example, in the first example with query "animal, flowers, " all the ranking lists appreciate the panda image as the top ranked image. But the results from the 4th to the 10th tend to be diversified. The aggregated ranking results are most appreciated since the top ranked images are more semantically consistent than other ranking strategies.
More efficiency on retrieving large-scale data. When retrieving large-scale social media data, it is time-consuming to conduct preference re-ranking and rank aggregation. To address this concern, a simple scheme can be used to reduce the retrieving complexity. Specifically, when processing single word queries, we quickly select P images whose semantic projection values are larger than a predefined threshold of the non-zero query dimension where P is much smaller than the database size N. When processing multiword queries, we quickly select P 1 images based on the scores of the first non-zero query dimension, and select P 2 similarly from the P 1 selected images where P 2 P 1 N. The top ranked images with high semantic relevance can be quickly identified by the much more efficient "find" operations instead of the inner product and sorting operations.
The query word patterns. We observe from the retrieval results that the retrieval performance of multi-word queries is generally higher than single-word queries. When the user queries are multi-word, the retrieval results tend to be boosted by involving more weak learners from different query dimensions. This phenomenon can be attributed to the tag co-occurrence existing on real social media image data. Our approach is capable of capturing such correlation in the complicated patterns in user queries.

Conclusion
In this paper, we proposed OSIR as a solution framework to facilitate the diversified preference styles in social media image searching by combining heterogeneous information sources. First, we proposed an efficient Online Multiple Kernel Ranking model constructed on multiple query dimensions and complimentary feature channels. By optimizing the ranking performance, the semantic consistency between the image ranking and textual query input is directly maximized without relying on intermediate semantic annotation procedure. Second, we constructed random walk-based preference modeling by domain-specific similarity calculation on heterogeneous social attributes. By re-ranking the rank output of OMKR based on each of the preference models, we obtained a set of ranking lists encoding different potential aspects of user preference. Last, we proposed an effective and efficient position-sensitive rank aggregation approach to aggregate multiple ranking results based on the user's preference specification. Extensive experiments on two social media datasets have demonstrated the advantages of our approach in both retrieval performance and user experiences. In future work, we will investigate how to model the online user behaviors in a more comprehensive way to better facilitate the user preference.