Skip to main content

Minimal residual ordinal loss hashing with an adaptive optimization mechanism

Abstract

The binary coding technique has been widely used in approximate nearest neighbors (ANN) search tasks. Traditional hashing algorithms treat binary bits equally, which usually causes an ambiguous ranking. To solve this issue, we propose an innovative bitwise weight method dubbed minimal residual ordinal loss hashing (MROLH). Different from a two-step mechanism, MROLH simultaneously learns binary codes and bitwise weights by a feedback mechanism. When the algorithm converges, the binary codes and bitwise weights can be well adaptive to each other. Furthermore, we establish the ordinal relation preserving constraint based on quartic samples to enhance the power of preserving relative similarity. To decrease the training complexity, we utilize a tensor ordinal graph to represent quartic ordinal relation, and the original objective function is approximated by the one based on triplet samples. In this paper, we also assign different weight values to training samples. During the training procedure, the weight of each data is initialized to the same value, and we iteratively boost the weight of the data whose relative similarity is not well preserved. As a result, we can minimize the residual ordinal loss. Experimental results on three large-scale ANN search benchmark datasets, i.e., SIFT1M, GIST1M, and Cifar10, show that the proposed method MROLH achieves a superior ANN search performance in both the Hamming space and the weighted Hamming space over the sate-of-the-art approaches.

Introduction

The aim of hashing algorithms [16] is to learn the binary representations of data which can preserve their original similarity relationship in the Hamming space. Thus, hashing algorithms can retrieve the nearest neighbors of a query data according to Hamming distances. As the advantageous in storage and computation, hashing algorithms have recently been popular in various computer vision and artificial intelligence applications, e.g., image retrieval, object detection, multi-task learning, linear classifier training, and active learning.

We roughly divide existing hashing algorithms into either data-independent hashing or data-dependent ones. The data-independent hashing, such as locality-sensitive hashing (LSH) [7], randomly generates hashing functions, and it typically requires a long binary code or multi-hash tables to achieve satisfying performance. In contrast, the data-dependent hashing algorithms, such as BDMFH [8] and ARE [9], utilize machine learning mechanisms to learn similarity preserving binary codes. Bidirectional discrete matrix factorization hashing (BDMFH) [8] proposes to alternate two mutually promoted processes of learning binary codes from data and recovering data from the binary codes. To enforce the learned binary codes inheriting intrinsic structure from the original data, BDMFH designs an inverse factorization model. Angular reconstructive embeddings (ARE) method [9] learns binary codes by minimizing the reconstruction error between the cosine similarities computed by the original data and the binary embeddings. Usually, the data-dependent hashing can obtain an excellent approximate nearest neighbors (ANN) search performance with compact binary codes. Furthermore, according to the similarity preserving restriction, the data-dependent hashing can be divided into the absolute similarity preserving hashing [10, 11] and the relative similarity preserving hashing [6, 12]. The former ones emphasize that the Hamming distances of similar data pairs should be minimal enough, and they are proper for the semantic neighbor search task. The relative similarity preserving hashing demands that the ranking orders of data in different spaces should be consistent with each other. Thus, the relative similarity preserving hashing can achieve a better ANN search performance.

Traditional hashing algorithms treat each binary bit equally, which would cause an ambiguous ranking. For M-bit binary codes, there are \(C_{M}^{m}\) kinds of data sharing the same Hamming distance m to a query sample. To further explain this phenomenon, we give a simple example as in Fig. 1.

Fig. 1
figure1

The hashing functions H={h1(x),h2(x)} map the data to 2-bit binary code. Without considering the importance of binary bits, x2 and x3 share the same Hamming distance to x1

In Fig. 1, H={h1(x),h2(x)} represents a set of linear hashing functions, and it separately maps x1,x2, and x3 to a 2-bit binary code. If the importance of each binary bit is considered to be equal, x2 and x3 have the same Hamming distance to x1. As a result, x2 and x3 will be simultaneously returned when retrieving the nearest neighbors of x1 in the Hamming space. However, the similarity degrees of (x1,x2) and (x1,x3) are different in the Euclidean space. To avoid such an ambiguous situation, the bitwise weight methods are proposed to assign different values to each binary bit. Thus, the similarity degree among the data pairs with the same Hamming distance can be distinguished by the weighted Hamming distances. In Fig. 1, according to the distribution of the query data x1 and the hashing functions, a larger weight value is assigned to h2(x). As a result, the weighted Hamming distance of (x1,x2) is larger than that of (x1,x3). When retrieving the nearest neighbors of x1,x3 is firstly returned. As described above, in order to further distinguish the ranking orders of the data with the same Hamming distance to a query data, we should take the importance of bits into consideration. However, the bitwise weight methods, such as QaRank [13, 14], QsRank [15], WhRank [16], and QRank [17], just focus on learning bitwise weights by a two-step mechanism. In this setting, these methods firstly generate binary codes by an existing hashing method (e.g., LSH [7] and ITQ [10]), then generate bitwise weights according to the learnt codes. The two-stage schema causes the learning process of binary codes and bitwise weights to separate with each other, and their performances cannot be iteratively boosted.

In this paper, we propose a novel bitwise weight method dubbed minimal residual ordinal loss hashing (MROLH) and the flowchart is shown in Fig. 2. To enhance the power of preserving relative similarity, we define the ordinal relation preserving objective function based on quartic samples in (a). In (b), we transform the constraint and utilize a tensor ordinal graph to decrease the training time consuming. Unlike most hashing, we simultaneously learn the relative similarity preserving binary codes and bitwise weights with a feedback mechanism by steps (c), (e), and (f). During the iterative training process, we update the weights of the data whose relative similarity is not well preserved by steps (d) and (g), which can minimize the residual performance loss. We compare the proposed MROLH against various state-of-the-art hashing methods on three widely used benchmarks, SIFT1M [18], GIST1M [19], and Cifar10 [20]. Quantitative experiments demonstrate that our algorithm achieves the best ANN search performance in both the Hamming space and the weighted Hamming space.

Fig. 2
figure2

The flowchart of the minimal residual ordinal loss hashing

The main contributions of this paper include:

1. In this paper, both binary codes and bitwise weights are demanded to preserve the original relative similarity of training data, and we establish the similarity preserving constraint based on quartic samples to enhance the power of preserving ordinal relation.

2. To decrease training time complexity, we embed the quartic ordinal relationship into a triplet one and utilize a tensor product graph to approximate the ordinal set.

3. During the iterative training process, we jointly learn binary codes and bitwise weights by a feedback mechanism to make them well adaptive to each other, and fix the problem of residual performance loss by boosting the weights of the data whose ordinal relation is not well preserved.

The rest of this paper is organized as follows: In Section 2, we briefly overview the relative similarity preserving hashing and the bitwise weight methods. Section 3 describes the proposed MROLH with three innovation measures. In Section 4, we show and analyze the comparative experiments on three large datasets. Finally, we conclude this paper in Section 5.

Related work

In this paper, we mainly focus on two issues: (a) How to preserve the original ordinal relation in the Hamming space and the weighted Hamming space. (b) How to guarantee bitwise weights and binary codes are well adaptive to each other.

To solve problem (a), we demand binary codes and bitwise weights to preserve the relative similarity. However, almost of the existing relative similarity preserving restrictions are defined based on triplet samples, which has an inferior ANN search performance. Minimal loss hashing [21] defines a hing-like loss to penalize the similar (dissimilar) data pair with a large (small) Hamming distance, and it solves this issue by optimizing the convex-concave upper bound of the objective function using a perception-like learning procedure. Triplet loss hashing [22] and listwise supervision hashing [23] directly demand that the Hamming distance among similar data points should be minimal than that among dissimilar data points. Ordinal preserving hashing (OPH) [12] divides all training data into different clusters, and all cluster centers are involved in computing the performance loss. However, OPH demands the distribution of training samples should be uniform. Ordinal constraint hashing (OCH) [6] aims to minimize retrieval loss by preserving ordinal relations of ranking tuples in the Hamming space. As the number of ranking tuples is quadratic or cubic to the size of the training samples, it is difficult to build ranking tuples efficiently in a large-scale data set. To fix the above problem, OCH embeds in which the original quartic order relation can hold as the triplet order relation.

As Hamming distances are discrete integer values, many data pairs with different binary codes would share the same distance value which causes their relative similarity relationship hard to distinguish. To fix this issue, the bitwise weight methods propose to assign different weight values to each bit. QaRank [13, 14] learns bitwise weights by minimizing the intra-class distance while preserving the inter-class relationship computed based on original training samples. The bitwise weights in QsRank [15] are learned according to the probability of mapping training samples to specified codes, and it is well designed for PCA hashing. WhRank [16] takes the distribution of query samples into consideration, which can effectively distinguish the similarity relationship among data pairs with the same binary codes. The bitwise weights in QRank [17] relates to the discriminate ability of hashing functions and the distribution of query data. Most bitwise weight methods adopt a two-step mechanism, which firstly learns binary codes by a hashing method (such as LSH [7] or ITQ [10]), then generates bitwise weights according to the learnt binary codes. As a result, the retrieval results obtained by weighted Hamming distances cannot further feedback the procedure of learning binary codes, which causes binary codes and bitwise weights not well adaptive to each other as in problem (b).

Methods

For xRd, we can map it into M-bit binary code B={b1,,bM} by the hash functions H(x)={h1(x),,hM(x)}, and the mth bit bm is calculated as bm(x)=sgn(hm(x)). In this paper, hm(x) is a linear function.

Generally, Hamming distances are utilized to achieve an ANN search task. But, it usually causes an ambiguous ranking order [1517]. To avoid this embarrassing situation, we learn the bitwise weights W(x)={w1(x),,wM(x)} of data x, and wm(x) represents the mth bit weight function.

In this paper, to ensure the hashing functions H(x) and bitwise weight functions W(x) have an excellent performance, we propose three innovation measures which are described in Sections 3.1, 3.2, and 3.3.

The ordinal relation preserving constraint based on quartic samples

As discussed in many previous works [6, 12], both the absolute similarity preserving hashing and the relative similarity preserving hashing based on triplet samples have a poor performance in retrieving approximate nearest neighbors. In contrast, we demand binary codes and bitwise weights should satisfy the ordinal relation preserving constraint defined based on quartic samples as in Eq. (1). It directly maximums the number of the data points whose ordinal relation is well preserved in set C.

$$\begin{array}{@{}rcl@{}} \text{max} \sum_{(x_{i},x_{j},x_{k},x_{l})\in C} I(D_{H}(x_{i},x_{j})\leq D_{H}(x_{i},x_{k})\leq D_{H}(x_{i},x_{l}))\;\, \end{array} $$
(1)

(xi,xj,xk,xl) are the quartic samples which satisfy the ordinal relationship defined in the Euclidean space. I(·) is the judge function. It returns 1, if the condition is satisfied; otherwise, 0 is returned.

For the problem defined in Eq. (1), the primary question is how to construct the ordinal relation preserving set C. Generally, we can establish the set C by collecting similar data pairs and dissimilar ones. However, it is hard to define the similarity relationship. To fix this problem, we adopt a tensor product graph G to represent the ordinal relationship of quartic samples as below:

$$\begin{array}{@{}rcl@{}} G=S\otimes DS \end{array} $$
(2)

The definition of graph S is shown in Eq. (3), which utilizes the distance value to indicate the similarity relationship.

$$\begin{array}{@{}rcl@{}} S(i,j)=\left\{ \begin{array}{rl} 0 &{\kern5pt} {i = j}\\ e^{\frac{-\|x_{i}-x_{j}\|_{2}^{2}}{2\sigma^{2}}} &{\kern5pt} {\text{otherwise}} \end{array} \right. \end{array} $$
(3)

DS represents the dissimilar graph, and the value of DS(i,j) is computed as in Eq. (4).

$$\begin{array}{@{}rcl@{}} DS(i,j)=\frac{1}{S(i,j)} \end{array} $$
(4)

represents the Kronecker product of matrixes, then G(ij,kl)=S(i,jDS(k,l). As a result, the value in G can represent the similarity relationship of quartic samples as in Eq. (5).

$$\begin{array}{@{}rcl@{}} \left\{ \begin{array}{rl} S(i,j)<S(k,l) &{\kern5pt} G(ij,kl)>1\\ S(i,j)>S(k,l) &{\kern5pt} G(ij,kl)<1 \end{array} \right. \end{array} $$
(5)

As described above, the ordinal relation preserving set C can be constructed according to the tensor ordinal graph G. But, for massive samples, the construction time complexity is relatively higher. So, we further transform the ordinal relation constraint as shown in Eq. (6).

$$\begin{array}{@{}rcl@{}} {\begin{aligned} &\sum_{\forall(x_{i}\neq x_{j},x_{k},x_{l})}I\left(\left(\|x_{i}-x_{j}\|_{2}^{2}-\|x_{i}-x_{k}\|_{2}^{2}\right)\right.\\ &\left.-\left(\|x_{i}-x_{j}\|_{2}^{2}-\|x_{i}-x_{l}\|_{2}^{2}\right)\right) \\ &= \sum_{\forall(x_{i}\neq x_{j},x_{k},x_{l})}I\left(\left(x_{i}^{T}x_{j}-x_{i}^{T}x_{k}\right)^{2}\,-\,\left(x_{i}^{T}x_{j}-x_{i}^{T}x_{l}\right)^{2}\right) \\ &=\sum_{\forall(x_{j},x_{k},x_{l})}I\left((x_{j}-x_{l})^{T}M(x_{j}-x_{l})\,-\,(x_{j}\,-\,x_{k})^{T}M(x_{j}\,-\,x_{k})\right) \end{aligned}} \end{array} $$
(6)

where \(M=\sum _{i}x_{i}^{T}x_{i}\) is a positive semi-definite symmetrical matrix. So, it is convenient to use SVD to decompose into \(\phantom {\dot {i}\!}Z\in R^{d_{\text {svd}}\times d}\) such that M=ZTΛZ. Then, a mapping function can be defined as \(\phantom {\dot {i}\!}u_{i}={Zx}_{i}\in R^{d_{\text {svd}}}\), and the ordinal relation constraint can be written as in Eq. (7).

$$\begin{array}{@{}rcl@{}} & & \sum_{\forall(u_{j},u_{k},u_{l})}I\left((u_{j}-u_{l})^{T}\Lambda(u_{j}-u_{l})-(u_{j}-u_{k})^{T}\Lambda\right) \\ &\leq & \sum_{\forall(u_{j},u_{k},u_{l})}I\left(\left\|\Lambda^{\frac{1}{2}} \right\|_{2}^{2}\cdot \left(\|u_{j}-u_{l}\|_{2}^{2}-\|u_{j}-u_{k}\|_{2}^{2}\right)\right) \\ &\propto&\sum_{\forall(u_{j},u_{k},u_{l})}I\left(\|u_{j}-u_{l}\|_{2}^{2}-\|u_{j}-u_{k}\|_{2}^{2}\right) \end{array} $$
(7)

Finally, the objective function defined in Eq. (8) is utilized to learn binary codes. The set \(\hat {C}\) can be easily constructed by selecting the elements whose values are minimal than 1 in G.

$$\begin{array}{@{}rcl@{}} \text{min} \sum_{(x_{i},x_{j},x_{k})\in \hat{C}}I(D_{H}(x_{i},x_{j})\geq D_{H}(x_{i},x_{k})) \end{array} $$
(8)

Similarly, the ordinal relation preserving restriction for bitwise weights is re-defined as in Eq. (9).

$$\begin{array}{@{}rcl@{}} \text{min} \sum_{(x_{i}, x_{j}, x_{k})\in \hat{C}} I(D_{WH}(x_{i},x_{j}) > D_{WH}(x_{i},x_{k})) \end{array} $$
(9)

Minimal residual loss

For traditional algorithms, the weights of samples keep unchanged during training process. As a result, each hash function and bitwise weight just try to minimize the performance loss induced by its own, and the residual loss caused by their former ones are totally ignored. To fix up the above problem, we propose to iteratively boost the weights of the data whose similarity relationship is not well preserved.

Initially, we set the weight of each data as \(\frac {1}{n}\) (n is the number of the training samples), and we utilize Eq. (10) to update their weights during the training process.

$$\begin{array}{@{}rcl@{}} \pi_{m}^{r+1}(x_{i})=\pi_{m}^{r}(x_{i})\cdot \beta^{1-T(x_{i})} \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} \beta = \frac{\xi_{m}^{r}}{1-\xi_{m}^{r}} \end{array} $$
(11)

\(\pi _{m}^{r}(x_{i})\) is the weight of xi for the mth hash function or bitwise weight function during the rth training procedure. T(xi) returns 0, if the similarity relationship among xi and its nearest neighbors is preserved; otherwise, 1 is returned. The definition of \(\xi _{m}^{r}\) is shown in Eq. (12).

$$\begin{array}{@{}rcl@{}} \xi_{m}^{r} = \sum_{i=1}^{n}\pi_{m}^{r}(x_{i})T(x_{i}) \end{array} $$
(12)

After introducing the data weights, we separately redefine the objective function for learning hash functions and bitwise weight functions as in Eqs. (13) and (14). \(\pi _{m}^{r}(x_{i})\) is the weight value of the samples when the algorithm converges.

$$\begin{array}{@{}rcl@{}} \text{min} \sum_{(x_{i}, x_{j}, x_{k})\in \hat{C}} \pi_{M}^{R}(x_{i})\cdot I(D_{H}(x_{i},x_{j})\geq D_{H}(x_{i},x_{k})) \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} \text{min} \sum_{(x_{i}, x_{j}, x_{k})\in \hat{C}} \pi_{M}^{R}(x_{i})\cdot I(D_{WH}(x_{i},x_{j}) > D_{WH}(x_{i},x_{k}))\;\, \end{array} $$
(14)

Joint optimization

To make binary codes and bitwise weights well adaptive to each other, we propose a joint optimization mechanism, and the objective function is defined as in Eq. (15). During the training process, we iteratively optimize the parameters of hash functions and bitwise weight functions.

$$\begin{array}{@{}rcl@{}} \Theta &=&\text{min} \sum_{(x_{i}, x_{j}, x_{k})\in \hat{C}}\pi_{M}^{R}(x_{i})[ I(D_{H}(x_{i},x_{j})\geq D_{H}(x_{i},x_{k})) \\ &+& I(D_{WH}(x_{i},x_{j})> D_{WH}(x_{i},x_{k}))] \end{array} $$
(15)

In this paper, the sign function is utilized to generate discrete integer values, which makes the objective function become NP hard problem. To solve this issue, we adopt tanh(·) to approximate sign(·) function. Then, the binary code is re-defined as B(xi)=tanh(VTxi). Thus, we can separately compute the Hamming distance and the weighted Hamming distance by Eqs. (16) and (17). M is the number of binary bits. is the bitwise product operation.

$$\begin{array}{@{}rcl@{}} {\begin{aligned} D_{H}(x_{i},x_{j})&=\frac{1}{2}\left(M-B^{T}(x_{i})\cdot B(x_{j})\right) \\ & = \frac{1}{2}\left(M-\sum_{m=1}^{M}\left[\text{tanh}^{T}\left(v_{m}^{T} x_{i}\right)\cdot \text{tanh}\left(v_{m}^{T} x_{j}\right)\right]\right) \end{aligned}} \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} D_{WH}(x_{i},x_{j})&=\frac{1}{2}\left(M-\left(W(x_{i})\odot B(x_{i})\right)^{T}\cdot B(x_{j})\right) \\ & = \frac{1}{2}\left(M-\sum_{m=1}^{M}\left[w_{m}(x_{i})\cdot \text{tanh}^{T}\left(v_{m}^{T} x_{i}\right)\right.\right.\\ &\left.\left. \cdot \text{tanh}\left(v_{m}^{T} x_{j}\right)\right]\right) \end{aligned}} \end{array} $$
(17)

If we define \(\phi (\hat {c})\) and \(\phi (\tilde {c})\) as in Eqs. (18) and (19), the objective function can be rewritten as in Eq. (20).

$$\begin{array}{@{}rcl@{}} \phi(\hat{c})=\frac{1}{1+\text{exp}(D_{H}(x_{i},x_{k})-D_{H}(x_{i},x_{j}))} \end{array} $$
(18)
$$\begin{array}{@{}rcl@{}} \phi(\tilde{c})=\frac{1}{1+\text{exp}(D_{WH}(x_{i},x_{k})-D_{WH}(x_{i},x_{j}))} \end{array} $$
(19)
$$\begin{array}{@{}rcl@{}} \Theta =\text{min} \sum_{(x_{i},x_{j},x_{i},x_{k})\in \hat{C}} \pi_{M}^{R}(x_{i})\cdot (\phi(\hat{c})+\phi(\tilde{c})) \end{array} $$
(20)

When learning the mth hash function during the rth training procedure, the partial derivation of the objective function is shown in Eq. (21).

$$\begin{array}{@{}rcl@{}} {\begin{aligned} \frac{\partial \Theta}{\partial v_{m}} &= \sum_{\hat{c}\in \hat{C}}\phi(\hat{c})(1-\phi(\hat{c}))\cdot \pi_{m}^{r}(x_{i})\\&\quad\cdot \left\{\left [\frac{\partial D_{H}(x_{i},x_{k})}{\partial v_{m}}-\frac{\partial D_{H}(x_{i},x_{j})}{\partial v_{m}}\right]\right. \\ &\quad+ \left.\left[\frac{\partial D_{WH}(x_{i},x_{k})}{\partial v_{m}}-\frac{\partial D_{WH}(x_{i},x_{j})}{\partial v_{m}}\right]\right\} \end{aligned}} \end{array} $$
(21)

For the parameter vm, the partial derivation of the Hamming distance function and the weighted Hamming distance function can be computed as in Eqs. (22) and (23).

$$\begin{array}{@{}rcl@{}} {{}\begin{aligned} \frac{\partial D_{H}(x_{i},x_{j})}{\partial v_{m}} &\,=\,-\frac{1}{2}\left\{x_{i}\cdot \!\left[\left(1\,-\,\text{tanh}^{2}\left(v_{m}^{T} x_{i}\right)\right)\!\cdot\! \text{tanh}\left(v_{m}^{T} x_{j}\right)\!\right]^{T}\right. \\ &\,+\, x_{j} \cdot \left.\left[\left(1-\text{tanh}^{2}\left(v_{m}^{T}x_{j}\right)\right)\cdot \text{tanh}\left(v_{m}^{T}x_{i}\right)\right]^{T} \right\} \end{aligned}} \end{array} $$
(22)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} \frac{\partial D_{WH}(x_{i},x_{j})}{\partial v_{m}} &=-\frac{w_{m}(x_{i})}{2}\left\{x_{i}\cdot \left[\left(1-\text{tanh}^{2}\left(v_{m}^{T} x_{i}\right)\right)\right.\right.\\& \left. \left. \cdot \text{tanh}\left(v_{m}^{T} x_{j}\right)\right]^{T}\right. \\ & + x_{j} \cdot \left.\left[\left(1-\text{tanh}^{2}\left(v_{m}^{T}x_{j}\right)\right)\cdot \text{tanh}\left(v_{m}^{T}x_{i}\right)\right]^{T} \right\} \end{aligned}} \end{array} $$
(23)

As a result, the parameter vm can be updated by Eq. (24) during the rth training procedure.

$$\begin{array}{@{}rcl@{}} v_{m} = v_{m}-\eta \frac{\partial O}{\partial v_{m}} \end{array} $$
(24)

Similarly, for the parameter wm, the partial derivation of the objective function is shown in Eq. (25).

$$\begin{array}{@{}rcl@{}} {{}\begin{aligned} \frac{\partial \Theta}{\partial w_{m}} \,=\, \sum_{\tilde{c}\in \hat{C}}\phi(\tilde{c})(1\,-\,\phi(\tilde{c}))\!\cdot\! \pi_{m}^{r}(x_{i})\!\cdot\! \frac{\partial D_{WH}(x_{i},x_{k},x_{i},x_{j})}{\partial w_{m}} \end{aligned}} \end{array} $$
(25)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} \frac{\partial D_{WH}(x_{i},x_{k},x_{i},x_{j})}{\partial w_{m}} &= -\frac{1}{2}\cdot \text{tanh}(v_{m}^{T}x_{i})\\&\cdot\left(\text{tanh}\left(v_{m}^{T}x_{k}\right)-\text{tanh}\left(v_{m}^{T}x_{j}\right)\right) \end{aligned}} \end{array} $$
(26)

During the iterative training procedure, we can compute the value of wm by Eq. (27).

$$\begin{array}{@{}rcl@{}} w_{m}=w_{m}-\lambda\frac{\partial \Theta}{\partial w_{m}} \end{array} $$
(27)

The iterative process for learning the hash functions and bitwise weight functions which can preserve the ordinal relation is described as in Algorithm 1.

Results and discussion

In this section, we describe the ANN search comparative experiments.

Experimental setting

In this paper, we evaluate the comparative experiments on three large datasets SIFT1M [18], GIST1M [19], and Cifar10 [20], which are widely used in ANN search experiments. The SIFT1M dataset contains 1 million SIFT descriptors [24] with 128 dimensions, and 100,000 of them are considered as training samples. We also randomly select 10,000 features from SIFT1M dataset as query samples. In GIST1M dataset, there are 1 million 320-dimensional GIST descriptors [25], and we separately choose 50,000 and 10,000 data as training and query samples. The Cifar10 dataset contains 60,000 GIST features with 320 dimensions, and 50,000 samples are utilized as training dataset. Correspondingly, the number of query samples in Cifar10 dataset is 10,000.

The baseline methods include two kinds of algorithms: the binary code methods and bitwise weight methods. Locality-sensitive hashing (LSH) [7], iterative quantization hashing (ITQ) [10], and k-means hashing (KMH) [11] can generate the absolute similarity preserving binary codes. In contrast, ordinal constraint hashing (OCH) [6] aims to preserve the relative similarity in the Hamming space. QRank [17] and WhRank [16] assign different weights to each binary bit, which can be applied to further boost the ANN search performance of the binary code methods.

We use the criterion of mAP and recall to evaluate the ANN search performance. As defined in Eq. (28), recall represents the fraction of the positive data that are successfully returned. Npositive means the number of the positive data that are retrieved. Nall is the number of the true nearest neighbors.

$$\begin{array}{@{}rcl@{}} \text{recall} = \frac{N_{\text{positive}}}{N_{\text{all}}} \end{array} $$
(28)

The recall criterion cannot exactly express which position the ith positive data point locates in. To fix this problem, the criterion of mAP defined in Eq. (29) is adopted. Where |Q| represents the number of query samples, Ki is the number of the ith query sample’s ground truth. rank(j) is the ranking position of the jth true positive sample in the retrieval results.

$$\begin{array}{@{}rcl@{}} mAP = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{K_{i}}\sum_{j=1}^{K_{i}}\frac{j}{\text{rank}(j)} \end{array} $$
(29)

Experimental results

In this section, the data are separately mapped into 32-, 64-, and 128-bit binary codes, and their corresponding bitwise weights are learnt.

The purpose of hashing algorithms is to guarantee the approximate nearest neighbors’ retrieval results obtained in the Hamming space are identical to those in the Euclidean space. Therefore, we consider a data pair’s Euclidean distance as its true similarity degree, and we separately define the 10 and 100 samples with smaller Euclidean distances to a query data as its ground truth in this paper. We show the experimental results in Tables 1, 2, and 3, and Figs. 3, 4, and 5. In the experimental results, MROLB represents the retrieval results obtained according to the binary codes, and MROLH utilizes the bitwise weights to further improve the ANN search performance of MROLB. From the experimental results, we know that MROLH and MROLB separately obtains the best ANN search performance in the Hamming space and the weighted Hamming space.

Fig. 3
figure3

The recall curves of the comparative experiments in the Cifar10 dataset. The data are separately mapped into 32-, 64-, and 128- bit binary codes in the (1), (2), and (3) rows. The number of the true nearest neighbors in columns a and b separately are 10 and 100

Fig. 4
figure4

The recall curves of the comparative experiments in the GIST1M dataset. The data are separately mapped into 32-, 64-, and 128- bit binary codes in the (1), (2), and (3) rows. The number of the true nearest neighbors in columns a and b separately are 10 and 100

Fig. 5
figure5

The recall curves of the comparative experiments in the SIFT1M dataset. The data are separately mapped into 32-, 64-, and 128- bit binary codes in the (1), (2), and (3) rows. The number of the true nearest neighbors in columns a and b separately are 10 and 100

Table 1 The mAP(%) values of the comparative experimental results in the Cifar10 dataset
Table 2 The mAP(%) values of the comparative experimental results in the GIST1M dataset
Table 3 The mAP(%) values of the comparative experimental results in the SIFT1M dataset

LSH [7] randomly generates hashing functions without training process, and its performance cannot be obviously improved with the binary bits increasing. ITQ [10], KMH [11], and MROLB utilize a machine learning mechanism to generate compact binary codes which can achieve satisfying ANN search performance. ITQ [10] maps data points to the vertices of a hyper cubic. However, the vertices in ITQ [10] are fixed, and the encoding results are not adaptive to the data distribution. To fix this problem, KMH [11] learns encoding centers by simultaneously minimizing the quantization loss and the similarity loss. LSH [7], ITQ [10], and KMH [11] belong to the absolute similarity preserving hashing. In contrast, OCH [6] establishes an ordinal constraint to preserve the relative similarity among data points in the Hamming space. For the above hashing methods, the learning procedure of each binary bit is independent with each other, and the residual performance loss accumulated by former bits cannot be eliminated. To solve this problem, we propose to iteratively boost the weights of incorrectly encoded data during training process. Furthermore, we establish a ordinal relation preserving constraint based on quartic samples, which can obviously enhance the power of preserving relative similarity.

WhRank [16] can distinguish the similarity degree among the data pairs which have the same Hamming distance. Furthermore, the bitwise weights in QRank [17] and the proposed MROLH are sensitive to query data. As a result, for the data pairs with the same binary code, their similarity degree can be distinguished by QRank and MROLH. WhRank [16] and QRank [17] demand the bitwise weights should satisfy the absolute similarity preserving restriction, and utilize fixed binary codes to learn bitwise weights. Different from WhRank and QRank, we simultaneously learn the binary codes and bitwise weights by minimizing the ordinal relation preserving loss. As a result, MROLH can well preserve the relative similarity in both the Hamming space and the weighted Hamming space, and the ANN search performances can be iteratively boosted by the feedback mechanism.

The efficiency and convergence

An excellent hashing method should online encode a raw data efficiently and has a reasonable offline training time complexity [10]. Below, we separately discuss the time complexity of all compared methods.

For online encoding a query data as M-bit binary code, our algorithm, LSH [7], ITQ [10], OCH [6], and WhRank [16] need to compute the sign of the results projected by M linear functions, and they have the same time complexity of O(M). Correspondingly, KMH [11] should compute and compare the distances between a query data and 2M centers, and its time complexity is O(2M). QRank [17] firstly transforms a query data into anchor representation and computes its similarities to 2M landmarks in O(r+2M) and obtains query-adaptive weights by quadratic programming in polynomial time. Here, r represents the number of anchors.

For offline training stage, LSH [7] randomly generates M linear hashing functions with a constant time. QRank [17] represents 2M landmarks using r anchors in O(2Mr). The time complexity of WhRank [16] is O(Mdk), and k represents the number of nearest neighbors. ITQ [10] iteratively optimizes a rotation matrix with a linear time complexity. In order to decrease training time complexity, OCH [6] and KMH [11] just select n (nN) samples with d dimensions from all N data to join in their training procedure. For each iteration, KMH [11] computes and compares the distances between n data points and 2M centers, and the time complexity is O(2Mnd). In contrast, the overall training complexity of OCH [6] is O(tMn3d+nN), and t is the number of iteration. For our algorithm, the training process includes three stages, and we separately discuss their time complexity as below: Firstly, we adopt k-means algorithm to select n centers from N training samples, which needs to compare the distance relationship between each training data and all cluster centers. Therefore, the time complexity of the first stage is O(Nnd). Secondly, we utilize a gradient descent algorithm to minimize the performance loss, and the time complexity mainly depends on the number of training groups. Initially, a training group contains quartic items, and its number is n4. Actually, we project the original set to an approximation ordinal relation set established based on triplet elements, and the number of training groups reduces to n3. In addition, to map d dimensional data to M-bit binary code, the hash functions with Md parameters are learnt. As a result, the time complexity of the second stage is O(Mn3d). Thirdly, to minimize the residual error, we update the weights of n training samples before learning each hashing function, and the time complexity of this stage is O(Mn). As described above, the overall training time complexity of our method is O(Nnd+Mn3d+Mn).

To validate the above analysis, we separately test the efficiency of online encoding procedure and offline training process in the GIST1M dataset, and the time consumed is shown in Table 4.

Table 4 Comparison of time consumed (seconds) in the GIST1M dataset. We utilize 50,000 GIST to train hashing functions, and map 1 million GIST to 128-bit binary code during the online encoding procedure

Generally, we consider an algorithm to have converged when its objective value remains unchanged or changed a little. In this paper, we define the number of triplet elements whose ordinal relation is not well preserved as the objective value Θ. We conduct the convergence experiments in the GIST1M database, and the number of training samples is 50,000. As shown in Table 5, the objective value decreases as the iteration number increases. But, it changes a little after 700 iterations, and we consider the algorithm to have converged.

Table 5 The objective value (×103) decreases as the number of iteration increases, and it converges when the number of iteration reaches 700

Conclusion

In this paper, we propose a novel hashing algorithm dubbed minimal residual ordinal loss hashing (MROLH). Different from tradition hashing algorithms, MROLH simultaneously learns binary codes and bitwise weights by a feedback mechanism. When the algorithm converges, the encoding results and bitwise weights are well adaptive to each other. In this paper, we aim to preserve the data pairs’ original relative similarity in both the Hamming space and the weighted Hamming space. Furthermore, we establish the relative similarity preserving constraint based on quartic samples to obviously enhance the power of preserving ordinal relation. During the training process, we iteratively boost the weight of the data whose relative similarity is not preserved. Thus, the residual performance loss can be minimized during later training procedure. Extensive experiments on three benchmark datasets demonstrate that the proposed MROLH is superior to many existing stat-of-the-art approaches. In the future work, we will investigate to decrease the probability of an ambiguous ranking occurring at the top position of retrieval results.

Availability of data and materials

SIFT1M: [18]

GIST1M: [19]

Cifar10: [20]

Cifar10: http://www.cs.toronto.edu/~kriz/cifar.html

Please contact author for data requests.

Abbreviations

ITQ:

Iterative quantization hash

KMH:

k-means hash

LSH:

Locality-sensitive hash

MROLH:

Minimal residual ordinal loss hash

QRank:

Query-sensitive ranking method

WhRank:

Ranking based on weighted hamming distance

References

  1. 1

    X. Luo, P. Zhang, Z. Huang, L. Nie, X. Xu, Discrete hashing with multiple supervision. IEEE Trans. Image Process.28(6), 2962–2975 (2019).

    MathSciNet  Article  Google Scholar 

  2. 2

    M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, H.T. Shen, Collective reconstructive embeddings for cross-modal hashing. IEEE Trans. Image Process.28(6), 2770–2784 (2019).

    MathSciNet  Article  Google Scholar 

  3. 3

    Y. Cui, J. Jiang, Z. Lai, Z. Hu, W. Wong, Supervised discrete discriminant hashing for image retrieval. Pattern Recog.78:, 79–90 (2018).

    Article  Google Scholar 

  4. 4

    C. Yue, B. Liu, M. Long, J. Wang, in Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hashgan: deep learning to hash with pair conditional wasserstein gan (IEEESalt Lake City, 2018), pp. 1287–1296.

    Google Scholar 

  5. 5

    C. Li, C. Deng, N. Li, W. Liu, X. Gao, D. Tao, in Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Self-supervised adversarial hashing networks for cross-modal retrieval (IEEESalt Lake City, 2018), pp. 4242–4251.

    Google Scholar 

  6. 6

    H. Liu, R. Ji, J. Wang, C. Shen, Ordinal constraint binary coding for approximate nearest neighbor search. IEEE Trans. Pattern. Anal. Mach. Intell.41(4), 941–955 (2019).

    Article  Google Scholar 

  7. 7

    M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, in Proceedings of Twentieth Annual Symposium on Computational Geometry. Locality-sensitive hashing scheme based on p-stable distributions (ACMBrooklyn, 2004), pp. 253–262.

    Google Scholar 

  8. 8

    S. He, B. Wang, Z. Wang, Y. Yang, F. Shen, Z. Huang, H.T. Shen, Bidirectional discrete matrix factorization hashing for image search. IEEE Trans. Cybern., 1—12 (2019). https://ieeexplore.ieee.org/document/8863122.

  9. 9

    M. Hu, Y. Yang, F. Shen, N. Xie, H.T. Shen, Hashing with angular reconstructive embeddings. IEEE Trans. Image Process.27(2), 545–555 (2018).

    MathSciNet  Article  Google Scholar 

  10. 10

    Y. Gong, S. Lazebnik, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Iterative quantization: a procrustean approach to learning binary codes (IEEEColorado Springs, 2011), pp. 817–824.

    Google Scholar 

  11. 11

    K. He, F. Wen, J. Sun, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. K-means hashing: an affinity-preserving quantization method for learning binary compact codes (IEEEPortland, 2013), pp. 2938–2945.

    Google Scholar 

  12. 12

    J. Wang, J. WANG, N. YU, S. Li, in Proceedings of the 21st ACM International Conference on Multimedia. Order preserving hashing for approximate nearest neighbor search (ACMBarcelona, 2013), pp. 133–142.

    Google Scholar 

  13. 13

    Y.G. Jiang, J. Wang, S.F. Chang, in Proceedings of the 1st ACM International Conference on Multimedia Retrieval. Lost in binarization: query-adaptive ranking for similar image search with compact codes (ACMTrento, 2011), pp. 16–1168.

    Google Scholar 

  14. 14

    Y. G. Jiang, J. Wang, X. Xue, S. F. Chang, Query-adaptive image search with hash codes. IEEE Trans. Multimedia. 15(2), 442–453 (2013).

    Article  Google Scholar 

  15. 15

    H.Y. Shum, L. Zhang, X. Zhang, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Qsrank: query-sensitive hash code ranking for efficient neighbor search (IEEEProvidence, 2012), pp. 2058–2065.

    Google Scholar 

  16. 16

    L. Zhang, Y. Zhang, J. Tang, K. Lu, Q. Tian, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Binary code ranking with weighted hamming distance (IEEEPortland, 2013), pp. 1586–1593.

    Google Scholar 

  17. 17

    T. Ji, X. Liu, C. Deng, L. Huang, B. Lang, in Proceedings of the 22nd ACM International Conference on Multimedia. Query-adaptive hash code ranking for fast nearest neighbor search (ACMOrlando, 2014), pp. 1005–1008.

    Google Scholar 

  18. 18

    H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search. IEEE Trans. Pattern. Anal. Mach. Intell.33(1), 117–128 (2011).

    Article  Google Scholar 

  19. 19

    J. Wang, S. Kumar, S.F. Chang, Semi-supervised hashing for large-scale search. IEEE Trans. Pattern. Anal. Mach. Intell.33(12), 2393–2406 (2012).

    Article  Google Scholar 

  20. 20

    A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech Rep (2009).

  21. 21

    M. Norouzi, D.J. Fleet, in Proceedings of the 28th International Conference on Machine Learning. Minimal loss hashing for compact binary codes (ACMBellevue, 2011), pp. 353–360.

    Google Scholar 

  22. 22

    M. Norouzi, D.M. Blei, R. Salakhutdinov, in Proceedings of the Advances in Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe, USA. Hamming distance metric learning (Curran Associates IncLake Tahoe, 2012), pp. 1070–1078.

    Google Scholar 

  23. 23

    J. Wang, W. Liu, A.X. Sun, Y.G. Jiang, in Proceedings of the IEEE International Conference on Computer Vision. Learning hash codes with listwise supervision (IEEESydney, 2013), pp. 3032–3039.

    Google Scholar 

  24. 24

    D. G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60(2), 91–110 (2004).

    Article  Google Scholar 

  25. 25

    A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis.42(3), 145–175 (2001).

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

Funding

This work is funded by the Natural Science Foundation of Shandong Province of China (Grant No. ZR2018PF005), and the National Natural Science Foundation of China (Grant No. 61841602).

Author information

Affiliations

Authors

Contributions

All authors take part in the discussion of the work described in this paper. ZW, LZ, and PL conceived and designed the experiments. ZW performed the experiments. ZW, LZ, and FS analyzed the data. ZW, LZ, and FS wrote the paper. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Zhen Wang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Sun, F., Zhang, L. et al. Minimal residual ordinal loss hashing with an adaptive optimization mechanism. J Image Video Proc. 2020, 10 (2020). https://doi.org/10.1186/s13640-020-00497-4

Download citation

Keywords

  • Binary codes
  • Bitwise weights
  • Ordinal relation preserving
  • Joint optimization
  • Minimal residual loss