Minimal residual ordinal loss hashing with an adaptive optimization mechanism

The binary coding technique has been widely used in approximate nearest neighbors (ANN) search tasks. Traditional hashing algorithms treat binary bits equally, which usually causes an ambiguous ranking. To solve this issue, we propose an innovative bitwise weight method dubbed minimal residual ordinal loss hashing (MROLH). Different from a two-step mechanism, MROLH simultaneously learns binary codes and bitwise weights by a feedback mechanism. When the algorithm converges, the binary codes and bitwise weights can be well adaptive to each other. Furthermore, we establish the ordinal relation preserving constraint based on quartic samples to enhance the power of preserving relative similarity. To decrease the training complexity, we utilize a tensor ordinal graph to represent quartic ordinal relation, and the original objective function is approximated by the one based on triplet samples. In this paper, we also assign different weight values to training samples. During the training procedure, the weight of each data is initialized to the same value, and we iteratively boost the weight of the data whose relative similarity is not well preserved. As a result, we can minimize the residual ordinal loss. Experimental results on three large-scale ANN search benchmark datasets, i.e., SIFT1M, GIST1M, and Cifar10, show that the proposed method MROLH achieves a superior ANN search performance in both the Hamming space and the weighted Hamming space over the sate-of-the-art approaches.


Introduction
The aim of hashing algorithms [1][2][3][4][5][6] is to learn the binary representations of data which can preserve their original similarity relationship in the Hamming space. Thus, hashing algorithms can retrieve the nearest neighbors of a query data according to Hamming distances. As the advantageous in storage and computation, hashing algorithms have recently been popular in various computer vision and artificial intelligence applications, e.g., image retrieval, object detection, multi-task learning, linear classifier training, and active learning.
We roughly divide existing hashing algorithms into either data-independent hashing or data-dependent ones. The data-independent hashing, such as locality-sensitive hashing (LSH) [7], randomly generates hashing functions, *Correspondence: zhwang@sdut.edu.cn 1 School of Computer Science and Technology, Shandong University of Technology, Zibo, 255000 China Full list of author information is available at the end of the article and it typically requires a long binary code or multihash tables to achieve satisfying performance. In contrast, the data-dependent hashing algorithms, such as BDMFH [8] and ARE [9], utilize machine learning mechanisms to learn similarity preserving binary codes. Bidirectional discrete matrix factorization hashing (BDMFH) [8] proposes to alternate two mutually promoted processes of learning binary codes from data and recovering data from the binary codes. To enforce the learned binary codes inheriting intrinsic structure from the original data, BDMFH designs an inverse factorization model. Angular reconstructive embeddings (ARE) method [9] learns binary codes by minimizing the reconstruction error between the cosine similarities computed by the original data and the binary embeddings. Usually, the data-dependent hashing can obtain an excellent approximate nearest neighbors (ANN) search performance with compact binary codes. Furthermore, according to the similarity preserving restriction, the data-dependent hashing can be divided (2020) 2020: 10 Page 2 of 11 into the absolute similarity preserving hashing [10,11] and the relative similarity preserving hashing [6,12]. The former ones emphasize that the Hamming distances of similar data pairs should be minimal enough, and they are proper for the semantic neighbor search task. The relative similarity preserving hashing demands that the ranking orders of data in different spaces should be consistent with each other. Thus, the relative similarity preserving hashing can achieve a better ANN search performance. Traditional hashing algorithms treat each binary bit equally, which would cause an ambiguous ranking. For Mbit binary codes, there are C m M kinds of data sharing the same Hamming distance m to a query sample. To further explain this phenomenon, we give a simple example as in Fig. 1.
In Fig. 1, H = {h 1 (x), h 2 (x)} represents a set of linear hashing functions, and it separately maps x 1 , x 2 , and x 3 to a 2-bit binary code. If the importance of each binary bit is considered to be equal, x 2 and x 3 have the same Hamming distance to x 1 . As a result, x 2 and x 3 will be simultaneously returned when retrieving the nearest neighbors of x 1 in the Hamming space. However, the similarity degrees of (x 1 , x 2 ) and (x 1 , x 3 ) are different in the Euclidean space. To avoid such an ambiguous situation, the bitwise weight methods are proposed to assign different values to each binary bit. Thus, the similarity degree among the data pairs with the same Hamming distance can be distinguished by the weighted Hamming distances. In Fig. 1, according to the distribution of the query data x 1 and the hashing functions, a larger weight value is assigned to h 2 (x). As a result, the weighted Hamming distance of (x 1 , x 2 ) is larger than that of (x 1 , x 3 ). When retrieving the nearest neighbors of x 1 , x 3 is firstly returned. As described above, in order to further distinguish the ranking orders of the data with the same Hamming distance to a query data, we should take the importance of bits into consideration. However, the bitwise weight methods, such as QaRank [13,14], QsRank [15], WhRank [16], and QRank Fig. 1 The hashing functions H = {h 1 (x), h 2 (x)} map the data to 2-bit binary code. Without considering the importance of binary bits, x 2 and x 3 share the same Hamming distance to x 1 [17], just focus on learning bitwise weights by a two-step mechanism. In this setting, these methods firstly generate binary codes by an existing hashing method (e.g., LSH [7] and ITQ [10]), then generate bitwise weights according to the learnt codes. The two-stage schema causes the learning process of binary codes and bitwise weights to separate with each other, and their performances cannot be iteratively boosted.
In this paper, we propose a novel bitwise weight method dubbed minimal residual ordinal loss hashing (MROLH) and the flowchart is shown in Fig. 2. To enhance the power of preserving relative similarity, we define the ordinal relation preserving objective function based on quartic samples in (a). In (b), we transform the constraint and utilize a tensor ordinal graph to decrease the training time consuming. Unlike most hashing, we simultaneously learn the relative similarity preserving binary codes and bitwise weights with a feedback mechanism by steps (c), (e), and (f ). During the iterative training process, we update the weights of the data whose relative similarity is not well preserved by steps (d) and (g), which can minimize the residual performance loss. We compare the proposed MROLH against various state-of-the-art hashing methods on three widely used benchmarks, SIFT1M [18], GIST1M [19], and Cifar10 [20]. Quantitative experiments demonstrate that our algorithm achieves the best ANN search performance in both the Hamming space and the weighted Hamming space.
The main contributions of this paper include: 1. In this paper, both binary codes and bitwise weights are demanded to preserve the original relative similarity of training data, and we establish the similarity preserving constraint based on quartic samples to enhance the power of preserving ordinal relation.
2. To decrease training time complexity, we embed the quartic ordinal relationship into a triplet one and utilize a tensor product graph to approximate the ordinal set.
3. During the iterative training process, we jointly learn binary codes and bitwise weights by a feedback mechanism to make them well adaptive to each other, and fix the problem of residual performance loss by boosting the weights of the data whose ordinal relation is not well preserved.
The rest of this paper is organized as follows: In Section 2, we briefly overview the relative similarity preserving hashing and the bitwise weight methods. Section 3 describes the proposed MROLH with three innovation measures. In Section 4, we show and analyze the comparative experiments on three large datasets. Finally, we conclude this paper in Section 5.

Related work
In this paper, we mainly focus on two issues: (a) How to preserve the original ordinal relation in the Hamming To solve problem (a), we demand binary codes and bitwise weights to preserve the relative similarity. However, almost of the existing relative similarity preserving restrictions are defined based on triplet samples, which has an inferior ANN search performance. Minimal loss hashing [21] defines a hing-like loss to penalize the similar (dissimilar) data pair with a large (small) Hamming distance, and it solves this issue by optimizing the convexconcave upper bound of the objective function using a perception-like learning procedure. Triplet loss hashing [22] and listwise supervision hashing [23] directly demand that the Hamming distance among similar data points should be minimal than that among dissimilar data points. Ordinal preserving hashing (OPH) [12] divides all training data into different clusters, and all cluster centers are involved in computing the performance loss. However, OPH demands the distribution of training samples should be uniform. Ordinal constraint hashing (OCH) [6] aims to minimize retrieval loss by preserving ordinal relations of ranking tuples in the Hamming space. As the number of ranking tuples is quadratic or cubic to the size of the training samples, it is difficult to build ranking tuples efficiently in a large-scale data set. To fix the above problem, OCH embeds in which the original quartic order relation can hold as the triplet order relation.
As Hamming distances are discrete integer values, many data pairs with different binary codes would share the same distance value which causes their relative similarity relationship hard to distinguish. To fix this issue, the bitwise weight methods propose to assign different weight values to each bit. QaRank [13,14] learns bitwise weights by minimizing the intra-class distance while preserving the inter-class relationship computed based on original training samples. The bitwise weights in QsRank [15] are learned according to the probability of mapping training samples to specified codes, and it is well designed for PCA hashing. WhRank [16] takes the distribution of query samples into consideration, which can effectively distinguish the similarity relationship among data pairs with the same binary codes. The bitwise weights in QRank [17] relates to the discriminate ability of hashing functions and the distribution of query data. Most bitwise weight methods adopt a two-step mechanism, which firstly learns binary codes by a hashing method (such as LSH [7] or ITQ [10]), then generates bitwise weights according to the learnt binary codes. As a result, the retrieval results obtained by weighted Hamming distances cannot further feedback the procedure of learning binary codes, which causes binary codes and bitwise weights not well adaptive to each other as in problem (b).

Methods
For x ∈ R d , we can map it into M-bit binary code B = {b 1 , · · · , b M } by the hash functions H(x) = {h 1 (x), · · · , h M (x)}, and the mth bit b m is calculated as b m (x) = sgn(h m (x)). In this paper, h m (x) is a linear function.
Generally, Hamming distances are utilized to achieve an ANN search task. But, it usually causes an ambiguous ranking order [15][16][17]. To avoid this embarrassing situation, we learn the bitwise weights W (x) = {w 1 (x), · · · , w M (x)} of data x, and w m (x) represents the mth bit weight function.
In this paper, to ensure the hashing functions H(x) and bitwise weight functions W (x) have an excellent performance, we propose three innovation measures which are described in Sections 3.1, 3.2, and 3.3.

The ordinal relation preserving constraint based on quartic samples
As discussed in many previous works [6,12], both the absolute similarity preserving hashing and the relative similarity preserving hashing based on triplet samples have a poor performance in retrieving approximate nearest neighbors. In contrast, we demand binary codes and bitwise weights should satisfy the ordinal relation preserving constraint defined based on quartic samples as in Eq.
(1). It directly maximums the number of the data points whose ordinal relation is well preserved in set C.
are the quartic samples which satisfy the ordinal relationship defined in the Euclidean space. I(·) is the judge function. It returns 1, if the condition is satisfied; otherwise, 0 is returned.
For the problem defined in Eq. (1), the primary question is how to construct the ordinal relation preserving set C. Generally, we can establish the set C by collecting similar data pairs and dissimilar ones. However, it is hard to define the similarity relationship. To fix this problem, we adopt a tensor product graph G to represent the ordinal relationship of quartic samples as below: The definition of graph S is shown in Eq. (3), which utilizes the distance value to indicate the similarity relationship.
DS represents the dissimilar graph, and the value of DS(i, j) is computed as in Eq. (4).
⊗ represents the Kronecker product of matrixes, then G(ij, kl) = S(i, j) · DS(k, l). As a result, the value in G can represent the similarity relationship of quartic samples as in Eq. (5).
As described above, the ordinal relation preserving set C can be constructed according to the tensor ordinal graph G. But, for massive samples, the construction time complexity is relatively higher. So, we further transform the ordinal relation constraint as shown in Eq. (6). where Then, a mapping function can be defined as u i = Zx i ∈ R d svd , and the ordinal relation constraint can be written as in Eq. (7).
Finally, the objective function defined in Eq. (8) is utilized to learn binary codes. The setĈ can be easily constructed by selecting the elements whose values are minimal than 1 in G.
Similarly, the ordinal relation preserving restriction for bitwise weights is re-defined as in Eq. (9).

Minimal residual loss
For traditional algorithms, the weights of samples keep unchanged during training process. As a result, each hash function and bitwise weight just try to minimize the performance loss induced by its own, and the residual loss caused by their former ones are totally ignored. To fix up the above problem, we propose to iteratively boost the weights of the data whose similarity relationship is not well preserved.
Initially, we set the weight of each data as 1 n (n is the number of the training samples), and we utilize Eq. (10) to update their weights during the training process.
π r m (x i ) is the weight of x i for the mth hash function or bitwise weight function during the rth training procedure. T(x i ) returns 0, if the similarity relationship among x i and its nearest neighbors is preserved; otherwise, 1 is returned. The definition of ξ r m is shown in Eq. (12).
After introducing the data weights, we separately redefine the objective function for learning hash functions and bitwise weight functions as in Eqs. (13) and (14). π R M (x i ) is the weight value of the samples when the algorithm converges.

Joint optimization
To make binary codes and bitwise weights well adaptive to each other, we propose a joint optimization mechanism, and the objective function is defined as in Eq. (15). During the training process, we iteratively optimize the parameters of hash functions and bitwise weight functions.
In this paper, the sign function is utilized to generate discrete integer values, which makes the objective function become NP hard problem. To solve this issue, we adopt tanh(·) to approximate sign(·) function. Then, the binary code is re-defined as B(x i ) = tanh(V T x i ). Thus, we can separately compute the Hamming distance and the weighted Hamming distance by Eqs. (16) and (17). M is the number of binary bits. is the bitwise product operation.
When learning the mth hash function during the rth training procedure, the partial derivation of the objective function is shown in Eq. (21).
For the parameter v m , the partial derivation of the Hamming distance function and the weighted Hamming distance function can be computed as in Eqs. (22) and (23).
As a result, the parameter v m can be updated by Eq. (24) during the rth training procedure. Similarly, for the parameter w m , the partial derivation of the objective function is shown in Eq. (25).
During the iterative training procedure, we can compute the value of w m by Eq. (27).
The iterative process for learning the hash functions and bitwise weight functions which can preserve the ordinal relation is described as in Algorithm 1.

Results and discussion
In this section, we describe the ANN search comparative experiments.

Experimental setting
In this paper, we evaluate the comparative experiments on three large datasets SIFT1M [18], GIST1M [19], and Cifar10 [20], which are widely used in ANN search experiments. The SIFT1M dataset contains 1 million SIFT descriptors [24] with 128 dimensions, and 100,000 of them are considered as training samples. We also randomly select 10,000 features from SIFT1M dataset as query samples. In GIST1M dataset, there are 1 million repeat 6: Compute v m according to Eq. (24). 7: Compute w m according to Eq. (27). 8: update the data weights according to Eq. (10). 9: until convergence 10: for i = 1 : n do 11: Assign the weight of x i to π 0 m+1 (x i ). 12: end for 13: end for 320-dimensional GIST descriptors [25], and we separately choose 50,000 and 10,000 data as training and query samples. The Cifar10 dataset contains 60,000 GIST features with 320 dimensions, and 50,000 samples are utilized as training dataset. Correspondingly, the number of query samples in Cifar10 dataset is 10,000.
The baseline methods include two kinds of algorithms: the binary code methods and bitwise weight methods. Locality-sensitive hashing (LSH) [7], iterative quantization hashing (ITQ) [10], and k-means hashing (KMH) [11] can generate the absolute similarity preserving binary codes. In contrast, ordinal constraint hashing (OCH) [6]   aims to preserve the relative similarity in the Hamming space. QRank [17] and WhRank [16] assign different weights to each binary bit, which can be applied to further boost the ANN search performance of the binary code methods.
We use the criterion of mAP and recall to evaluate the ANN search performance. As defined in Eq. The recall criterion cannot exactly express which position the ith positive data point locates in. To fix this problem, the criterion of mAP defined in Eq. (29) is adopted. Where |Q| represents the number of query samples, K i is the number of the ith query sample's ground truth. rank(j) is the ranking position of the jth true positive sample in the retrieval results.

Experimental results
In this section, the data are separately mapped into 32-, 64-, and 128-bit binary codes, and their corresponding bitwise weights are learnt.
The purpose of hashing algorithms is to guarantee the approximate nearest neighbors' retrieval results obtained in the Hamming space are identical to those in the Euclidean space. Therefore, we consider a data pair's Euclidean distance as its true similarity degree, and we separately define the 10 and 100 samples with smaller Euclidean distances to a query data as its ground truth in this paper. We show the experimental results in Tables 1,  2, and 3, and Figs. 3, 4, and 5. In the experimental results, MROLB represents the retrieval results obtained according to the binary codes, and MROLH utilizes the bitwise weights to further improve the ANN search performance of MROLB. From the experimental results, we know that MROLH and MROLB separately obtains the best ANN search performance in the Hamming space and the weighted Hamming space. LSH [7] randomly generates hashing functions without training process, and its performance cannot be obviously improved with the binary bits increasing. ITQ [10], KMH [11], and MROLB utilize a machine learning mechanism to generate compact binary codes which can achieve satisfying ANN search performance. ITQ [10] maps data points to the vertices of a hyper cubic. However, the vertices in ITQ [10] are fixed, and the encoding results are not adaptive to the data distribution. To fix this problem, KMH [11] learns encoding centers by simultaneously minimizing the quantization loss and the similarity loss. LSH [7], ITQ [10], and KMH [11] belong to the absolute similarity preserving hashing. In contrast, OCH [6] establishes an ordinal constraint to preserve the relative similarity among data points in the Hamming space. For the above hashing methods, the learning procedure of each binary bit is independent with each other, and the residual performance loss accumulated by former bits cannot be eliminated. To solve this problem, we propose to iteratively boost the weights of incorrectly encoded data during training process. Furthermore, we establish a ordinal relation preserving constraint based on quartic samples, which can obviously enhance the power of preserving relative similarity. WhRank [16] can distinguish the similarity degree among the data pairs which have the same Hamming distance. Furthermore, the bitwise weights in QRank [17] and the proposed MROLH are sensitive to query data. As a result, for the data pairs with the same binary code,  their similarity degree can be distinguished by QRank and MROLH. WhRank [16] and QRank [17] demand the bitwise weights should satisfy the absolute similarity preserving restriction, and utilize fixed binary codes to learn bitwise weights. Different from WhRank and QRank, we simultaneously learn the binary codes and bitwise weights by minimizing the ordinal relation preserving loss. As a result, MROLH can well preserve the relative similarity in both the Hamming space and the weighted Hamming space, and the ANN search performances can be iteratively boosted by the feedback mechanism.

The efficiency and convergence
An excellent hashing method should online encode a raw data efficiently and has a reasonable offline training time complexity [10]. Below, we separately discuss the time complexity of all compared methods. For online encoding a query data as M-bit binary code, our algorithm, LSH [7], ITQ [10], OCH [6], and WhRank [16] need to compute the sign of the results projected by M linear functions, and they have the same time complexity of O(M). Correspondingly, KMH [11] should compute and compare the distances between a query data and 2 M centers, and its time complexity is O(2 M ). QRank [17] firstly transforms a query data into anchor representation and computes its similarities to 2 M landmarks in O(r + 2 M ) and obtains query-adaptive weights by quadratic programming in polynomial time. Here, r represents the number of anchors.
For offline training stage, LSH [7] randomly generates M linear hashing functions with a constant time. QRank [17] represents 2 M landmarks using r anchors in O(2 M r).
The time complexity of WhRank [16] is O(Mdk), and k represents the number of nearest neighbors. ITQ [10] iteratively optimizes a rotation matrix with a linear time complexity. In order to decrease training time complexity, OCH [6] and KMH [11] just select n (n N) samples with d dimensions from all N data to join in their training procedure. For each iteration, KMH [11] computes and compares the distances between n data points and 2 M centers, and the time complexity is O(2 M nd). In contrast, the overall training complexity of OCH [6] is O (tMn 3 d + nN), and t is the number of iteration. For our algorithm, the training process includes three stages, and we separately discuss their time complexity as below: Firstly, we adopt k-means algorithm to select n centers from N training samples, which needs to compare the distance relationship between each training data and all cluster centers. Therefore, the time complexity of the first stage is O (Nnd). Secondly, we utilize a gradient descent algorithm to minimize the performance loss, and the time complexity mainly depends on the number of training groups. Initially, a training group contains quartic items, and its number is n 4 . Actually, we project the original set to an approximation ordinal relation set established based on triplet elements, and the number of training groups reduces to n 3 . In addition, to map d dimensional data to M-bit binary code, the hash functions with Md parameters are learnt. As a result, the time complexity of the second stage is O (Mn 3 d). Thirdly, to minimize the residual error, we update the weights of n training samples before learning each hashing function, and the time complexity of this stage is O(Mn). As described above, the overall training time complexity of our method is O(Nnd + Mn 3 d + Mn).
To validate the above analysis, we separately test the efficiency of online encoding procedure and offline training process in the GIST1M dataset, and the time consumed is shown in Table 4.
Generally, we consider an algorithm to have converged when its objective value remains unchanged or changed a little. In this paper, we define the number of triplet elements whose ordinal relation is not well preserved as the objective value . We conduct the convergence experiments in the GIST1M database, and the number of training samples is 50,000. As shown in Table 5, the objective value decreases as the iteration number increases. But, it changes a little after 700 iterations, and we consider the algorithm to have converged.

Conclusion
In this paper, we propose a novel hashing algorithm dubbed minimal residual ordinal loss hashing (MROLH). Different from tradition hashing algorithms, MROLH simultaneously learns binary codes and bitwise weights by a feedback mechanism. When the algorithm converges, the encoding results and bitwise weights are well adaptive to each other. In this paper, we aim to preserve the data pairs' original relative similarity in both the Hamming space and the weighted Hamming space. Furthermore, we establish the relative similarity preserving constraint based on quartic samples to obviously enhance the power of preserving ordinal relation. During the training process, we iteratively boost