BoVW model based on adaptive local and global visual words modeling and log-based relevance feedback for semantic retrieval of the images

The core of a content-based image retrieval (CBIR) system is based on an effective understanding of the visual contents of images due to which a CBIR system can be termed as accurate. One of the most prominent issues which affect the performance of a CBIR system is the semantic gap. It is a variance that exists between low-level patterns of an image and high-level abstractions as perceived by humans. A robust image visual representation and relevance feedback (RF) can bridge this gap by extracting distinctive local and global features from the image and by incorporating valuable information stored as feedback. To handle this issue, this article presents a novel adaptive complementary visual word integration method for a robust representation of the salient objects of the image using local and global features based on the bag-of-visual-words (BoVW) model. To analyze the performance of the proposed method, three integration methods based on the BoVW model are proposed in this article: (a) integration of complementary features before clustering (called as non-adaptive complementary feature integration), (b) integration of non-adaptive complementary features after clustering (called as a non-adaptive complementary visual words integration), and (c) integration of adaptive complementary feature weighting after clustering based on self-paced learning (called as a proposed method based on adaptive complementary visual words integration). The performance of the proposed method is further enhanced by incorporating a log-based RF (LRF) method in the proposed model. The qualitative and quantitative analysis of the proposed method is carried on four image datasets, which show that the proposed adaptive complementary visual words integration method outperforms as compared with the non-adaptive complementary feature integration, non-adaptive complementary visual words integration, and state-of-the-art CBIR methods in terms of performance evaluation metrics.


Introduction
Due to a staggering increase in globalization, communication, and advancement in technology, the world has become a global village in its true sense. Digital image libraries are exponentially expanding because of the proliferation of social media and other information-sharing mediums. To extract meaningful information from such a huge repository requires certain techniques that can perform retrieval effectively and within minimum computational cost. Traditional text-based approaches retrieve images based on information that is annotated manually, which now has become impractical for such huge image repositories [1]. Another reason for opting for content-based image retrieval (CBIR) is a language dependency of textual annotations. CBIR has been a rapidly progressing area since 1990, and it retrieves images having similar contents/features, i.e., colors, shapes, and textures. It is categorized into two stages: (1) feature extraction and (2) feature matching. The purpose of the first stage is to get a feature vector that can effectively represent the visual contents of images. Features are categorized as global or local features. Global features encapsulate characteristics of an entire image as a single vector. Even though they are robust and computationally efficient, they may overlook the pixel's spatial relationship and local details [2]. On the contrary, local features preserve local characteristics of an image as they are extracted from patches of an image and are consider scale and rotation-invariant. Research has been done in the recent past to explore CBIR [3][4][5][6][7] and its applicability in different fields such as artificial intelligence (AI), human-computer interaction (HCI), and medical imaging. With the advent of deep learning approaches, research concern has now shifted towards deep features that can be learned by algorithms on their own. The ability of artificial neural networks to classify images either through supervised or unsupervised learning explored by Krizhevsky et al. [8] has taken the research inclination to a new dimension with their breakthrough results. Several feature descriptors are being developed for CBIR, but a selection of appropriate image representation is still challenging due to different issues such as illumination changes, viewing angles, and variation in image scale. As shown in Fig. 1, visual similarity between semantically different objects is also an intriguing issue that results in misclassification of an object, which affects the overall performance of the CBIR system. Another barrier for the retrieval system is an accurate feature matching. Most CBIR systems use similarity measures, whose performance highly depends on the selected feature descriptor and distance measure being used [10][11][12]. The research concern of today is to lessen the semantic gap concerning the images' low-level visual features and user high-level semantics to improve the accuracy of image retrieval systems. This study presents an innovative method for CBIR to advance the performance of image retrieval. The proposed methodology is categorized into two sections, namely training and testing sections. In the training section, complementary features are extracted using BFGF-HOG and GSURF feature descriptors from the images of the training group. To get optimized feature vectors, latent semantic analysis (LSA) followed by an adaptive feature weighting (AFW) method based on self-paced learning (SPL) is applied to each feature vector. Afterward, a visual vocabulary is constructed by applying adaptive fuzzy k-means (AFKM) clustering on each optimized feature vector, which represents the contents of the images in a more compact form. These two visual vocabularies are concatenated to get a resultant visual vocabulary that contains complementary features of both descriptors, which is termed as an adaptive complementary visual word integration in the proposed method of CBIR. In the next step, a histogram is formed using visual words of each image from the resultant complementary visual vocabulary. These histograms along with training labels are used as an input to quadratic kernel-based support vector machine (QSVM) for classification. In the testing section, the aforementioned steps are carried out on a query image taken from the testing group of images, which outputs a histogram-based visual representation of a query image. Afterward, a relevance score is computed between images residing in datasets and query image by applying Euclidean distance. For further improving the performance of image retrieval, the proposed method also uses a log-based relevance feedback mechanism.
The major contributions of this study are as follows: a. An innovative image representation method by integrating adaptive local and global visual words along with log-based relevance feedback based on the BoVW model. b. Non-adaptive complementary visual words integration for the principal objects of the images based on the BoVW model. c. Non-adaptive complementary feature integration for the principal objects of the image based on the BoVW model.
The remaining sections of this paper are structured as follows: Section 2 provides a detailed review of the relevant CBIR methods. Section 3 presents a detailed methodology of the proposed method. Section 4 provides detail of the experimental parameters for performance evaluations of the proposed method along with experimental results and discussion. Section 5 presents the conclusion and future directions of the research work.

Literature review
Numerous techniques have been developed to efficiently and effectively retrieve images from repositories having an immense and diverse collection of images from users around the globe, hence uncovering a field that makes computers understand or learn, enabling them to compete with the human brain, in short working towards AI and going deep down to imitate the working of neurons. CBIR has gained immense recognition in the recent past and motivates researchers to innovate new techniques to recognize objects or areas under consideration with the highest possible accuracy.
Singh et al. [2] presented a novel low-dimensional color texture descriptor named as a local binary pattern for color images (LBPC). A plane is used for a 3-dimensional RGB color space. LBP of color pixels is selected across a circularly symmetric neighbor lying within radius 2. Pixels having values above the plane are termed as 1 and below the plane as 0. A combination of hue component of HIS color space with LBP, i.e., LBPH and its fusion with color histogram (CH), is also analyzed to improve the discerning capability of the descriptor. To further reduce the dimension of the proposed descriptor, a uniform pattern with 59 bins is also calculated. In terms of performance, a fusion of LBPC, LBPH, and CH achieves better retrieval accuracy when an intra-class variation is highest. Meanwhile, uniform patterns of the proposed descriptor have achieved somewhat similar retrieval accuracy with a lower computational cost. LBP's for multi-channel color images are mostly calculated individually for each channel, thus results in loss of cross-channel information and higher computational cost. Misale et al. [10] presented an efficient CBIR system based on local tetra pattern (LTrP) features and bag-of-words (BoW) model. Initially, interest points are detected through SURF, and features are extracted locally through LTrP. Dataset images are classified in a 33: 33:34 ratio for training, validation, and testing, respectively. In the testing phase, a trained neural network is employed to classify images according to semantic categories. The performance of the proposed approach highlights better retrieval accuracy and reduced computational expense. A novel feature descriptor called as multi-trend binary code descriptor (MTBCD) is proposed by Yu et al. [13], which addresses some of the common issues faced by local feature descriptors in CBIR such as the change in pixel patterns, semantic gap, and lack of spatial information. The MTBCD descriptor works on the intensity component of the HSV model and identifies a change in trend among pixels along with four symmetrical directions (0°, 45°, 90°, 135°). The change in trend is classified as parallel, if the values of pixels within an assigned radius are in increasing or decreasing order, and as non-parallel, if values are equal or greater/smaller than the center pixels. To preserve the spatial relation among pixels, a co-occurrence matrix is also constructed. Experimental analysis depicts robustness of this framework against competitive methods.
Mistry et al. [14] designed and developed a robust CBIR system by integrating various spatial and frequency-based features. This method uses color moments, autocorrelogram and HSV histogram as spatial features and stationery, and Gabor wavelet transforms as frequency domain features. Apart from these, the approach also combines features extracted through color and edge directivity descriptor (CEDD) and binarized statistical image features (BSIF) descriptor. The feature vectors of 6-D and 64-D are extracted in case of color moments and color auto-correlogram, respectively. For the CEDD-BSIF feature set, 144-D CEDD and 256-D BSIF feature vectors are generated. Frequency domain features lead to better accuracy than spatial domain features when city block and Euclidean distance are utilized for measuring similarity while CEDD and BSIF features achieve the highest precision among all. However, this method is computationally expensive because of the high-dimensional feature vector. An innovative technique based on spatial histograms (spatiograms) is presented by Zeng et al. [15] to address issues faced by generalized histograms in CBIR, i.e., loss of spatial information, high dimensionality, and semantic gap. It quantizes the color space by using the Gaussian mixture model (GMM) learned through the expectation maximization-Bayesian information criterion (EM-BIC) algorithm, which automatically identifies the number of Gaussians (color bins) and associate pixels to multiple bins based on probability. Spatiograms are computed and incorporated with GMM. For determining a distance between spatiograms, a new measure based on Jensen-Shannon (JS) divergence is also proposed in this method. The experimental analysis highlights the robustness of the method for image retrieval. Roy et al. [16] presented a novel and highly discriminative rotation invariant texture descriptor named as a local directional zigzag pattern (LDZP). The proposed framework first reduces the noise of textured images by generating a local directional edge map (LDEM) through Kirsch compass mask along 6 directions from 0 to 150°with a 30°interval. Zigzag patterns and corresponding uniform histograms are extracted from each LDEM and concatenated to obtain rotation invariance. In terms of performance, LDZP efficiently encodes recurrent changes in local texture patterns and has better texture classification accuracy because of its zigzag sampling structure as compared to LBP which suffers from unreliable texture information because of its circular sampling structure. Amato et al. [17] investigated the application of aggregation methods to binary local features and presented a CBIR method based on Fisher kernels, Bernoulli mixture models, and CNN. The method is two times faster in extracting binary features as compared to the traditional SIFT method and can be used as an alternative to direct matching in CBIR. The information that we get from images may be insufficient to build a feature vector so Li et al. [18] suggested a re-ranking mechanism called discriminative multi-view interactive image re-ranking (DMINTIR) that integrates relevance feedback with complementary features. The feature set is encoded by utilizing neural code, VLAD+, and triangulation embedding. The proposed mechanism shuffles the images based on updated scores obtained through learned weight vector. To maximize precision, a new similarity learning method named maximum top precision similarity (MTPS) for the CBIR system is proposed [19]. The precision achieved after initial retrievals can be maximized by tuning parameters of similarity function. For that, similarity function is exhibited by hinge loss and designed as a linear function; squared Frobenius norm for each query is minimized to prevent overfitting problems. The experimental evaluation highlighted a shorter running time. Similarity measures have been evaluated in detail in [20]. The study concluded by suggesting a new matching measure by integrating relevance feedback and sequential forward selector.
Retrieving images based on regions usually results in the repetitive matching of similar regions and loss of spatial information. To overcome this issue, Meng et al. [21] presented a novel method for extracting and matching regions. Firstly, segments are identified and merged using statistical region merging and affinity propagation (SRM-AP). Instead of incorporating local descriptors, the method utilizes a CNN-based feature extraction method named as regional convolution mapping feature (RCMF) to preserve the spatial layout of the key objects of the image. Layer 5 of VGGNet19 is used as a feature layer, which outputs a 256-dimensional feature vector. For effective image representation, a number of regions and their locations are also incorporated with the RCMF method. Images are matched based on integrated category matching (ICM), which utilizes centroids rather than area or center-based methods. The method exhibits superior performance against benchmark methods but suffers from higher dimensionality of the feature vector. Another retrieval method based on the region is presented by Song et al. [22]. In this method of CBIR, the foreground and background parts of the HSV color space image are segmented by applying the Otsu algorithm. For extracting color, the hue component is quantized into 3 bins and the saturation component is quantized into 2 bins. The intensity component (V) of HSV space is utilized to generate diagonal texture structure descriptor (DTSD), which efficiently describes the edges and preserves spatial resolution and finer details of an image. The DTSD treats an image as a 4 × 4 grid and computes the difference between the center and neighboring pixels. Afterward, diagonal pixels are multiplied and evaluated based on a threshold. The resultant matrix is weighted, and values are accumulated to represent diagonal texture structure. The histograms of three components of both regions combinedly form a feature vector. In terms of performance, this method surpassed many competitive methods. A hybrid method for region-based image retrieval is presented by Ahmed et al. [23], which integrates local and global features for effective image representation. In this method, interest points of the image are assembled using connected stable regions method and described using the histogram of oriented gradients. For extracting texture, uniform local binary patterns are used. The resultant higher-dimensional features are transformed into compact vectors by applying the principal component analysis (PCA) method. Experimental analysis shows improved accuracy as compared with competitive CBIR methods. Other than the semantic gap, one of the major setbacks for CBIR is edge-based object identification, which only uses edges to differentiate objects having visually similar content and spatial invariance problem, which arises because of the varied spatial position of objects within images. Pradhan et al. [24] addressed these problems by incorporating a color edge map for extracting color and shape features simultaneously and a novel image block re-ordering method based on texture direction. Initially, foreground and background regions are extracted through saliency maps. Edges from the foreground part are first extracted through a combined edge map (canny edge, fuzzy edge) and later through color edge map by accumulating the pixels into 9 groups based on orientations. For texture, the Y component of the YC b C r color space is divided into 24 non-overlapping blocks and rearranged using principal texture direction, which is based on the largest eigenvalue of the intensity covariance matrix. In terms of performance, this rearrangement scheme resulted in better retrieval accuracy because the objects within images became more comparable to each other irrespective of their position. The compact detail of the competitive methods of CBIR is presented in Table 1.

Methodology
In this section, the methodology of the proposed method is presented in detail. The proposed method adopted the BoVW model that has been one of the most dominant and frequently used methods for classifying and retrieve images. The BoVW model (as shown in Fig   into clusters by applying clustering algorithms, each cluster head then termed as a visual word which accumulates into visual vocabulary or codebook; (3) for each image, a signature is formed by representing visual words in terms of a histogram, (4) histograms are normalized to retain fine details, and (5) these signatures are then fed into the classifier for training purposes. Apart from exhibiting remarkable performance in several image retrieval applications [36][37][38], the BoVW model still has certain limitations that need to be addressed, i.e., lack of spatial information, extraction of redundant, and insignificant features (background regions), and most importantly, it lacks from effective, efficient feature representation and feature weighting method as some features are of greater importance than others. The proposed method of image retrieval addresses the aforementioned issues of the BoVW model to improve the performance of image retrieval. The detail of each module of the training and testing sections of the proposed method is discussed in the following subsequent sections and its complete framework is shown in Fig. 3.

The training section of the methodology
This section presents the detail of the different modules of the proposed method, which are complementary feature extraction, adaptive feature weighting, clustering, histogram formation, and image classification. The detail of these modules is presented in the following subsequent sections.

Feature extraction using BFGF-HOG descriptor
This step comprises extracting features from each image by using the BFGF-HOG descriptor, which is a variant of the HOG descriptor. The HOG descriptor [39] has been used widely in machine vision tasks for detecting objects within images, humans, etc. It is a window-based descriptor and works by capturing the edge directions or local intensity gradients. A window is focused on interest points and partitioned into n × n cells.
For each pixel in a cell, gradient direction θ(x, y) and magnitude M(x, y) are mathematically calculated as follows: The computed gradient directions for each pixel are then quantized into 9 bin histogram of 45°, and the corresponding magnitudes are accumulated. The contrast of the resultant histogram is normalized to achieve illumination invariance.
Given an image I(x, y), a non-iterative bilateral field (BF), which efficiently preserves edges, is applied. The bilateral filter is an alternative to low-pass filters, which reduces noise but fade edges too. To overcome this, BF computes weighted averages like lowpass filters but utilizes geometric closeness (spatial) as well as photometric information/similarity between a center pixel c and its neighboring pixels (k − c) to calculate weights. Mathematically, it is expressed as follows: where N is a normalization constant, g(k, c) = k − c represents geometric closeness, and (I(k) − I(c)) measures the similarity between the center pixel and its neighbors.
After that, feature vector of the BF-based GF-HOG feature descriptor is computed, which represents image structure as dense gradient field (GF), interpolated by neighboring sparse edge pixels. Begin with binary canny edge map I e , edge orientations and magnitudes are calculated. Pixels having smaller magnitudes are discarded to obtain a set of sparse orientation edge pixels S = {θ(x, y) M > t } against a certain threshold t. The gradient field G R 2 is dense orientation field interpolated from sparse set S. Issue of smoothness of dense gradient field is solved by the Poisson equation with Dirichlet boundary conditions. The Poisson approximates ΔG = 0 by using a 3 × 3 Laplacian window, which results in a linear equation (Eq. (5)) with Dirichlet boundary conditions (Eq. 6).
After detecting keypoints by applying a Hessian detector on each image, a histogram of gradients (detail mentioned earlier) is then calculated over the density gradient field G and the range of orientations is quantized into m bins. The resultant vector is mn 2dimensional vector for the entire window. A resultant feature vector of the BFGF-HOG descriptor is 64 × J dimensional, where J represents a number of interest points of the features, which are automatically selected by the descriptor depending upon the contents of the image, and it is mathematically expressed as follows: where a 1d to a nd are image descriptors of the BFGF-HOG feature vector.

Feature extraction using Gauge SURF descriptor
This step comprises extracting features by applying the Gauge SURF (GSURF) descriptor to each image. To locally adapt the blur within a region and to retain fine details or edges, GSURF [40] feature descriptor utilizes gauge coordinates. Instead of using firstorder derivatives, GSURF detects keypoints from multiscale images using the determinant of the Hessian matrix. Hessian matrix is a result of convolving an integral image with second-order partial derivative Gaussian to obtain a maximum gradient. Give an image I(x, y), Hessian matrix H(z, σ) at point z(x, y) and scale parameter σ are mathematically defined as follows: where L xx is a convolution of second-order gauge derivative with image I at point z and is calculated as follows: and similarly L yy ðz; σÞ ¼ IðzÞÃ ∂ 2 gðσÞ ∂y 2 and L xy ðz; σÞ ¼ IðzÞÃ ∂ 2 gðσÞ ∂x ∂y . The motivation behind using gauge coordinates is their ability to describe each pixel in an image by its 2D local structure. Even if an image is rotated, the structure will remain the same. Gauge coordinates comprise of a gradient vector w ! and its perpendicular vector v ! , which are mathematically defined as follows: where L denotes convolution of image I with Gaussian kernel having σ as scale parameter, i.e., L(x, y, σ) = I(x, y) * g(x, y, σ).
Derivatives of any scale and order can be obtained using these coordinates. Secondorder derivatives of these coordinates are of special interest and can be calculated by responses in a horizontal and vertical direction are calculated over a 20 × 20 region, i.e., L x , L y , L xx , L yy , L xy . The 20 × 20 window is further subdivided into 4×4 sub-blocks without any overlap and Haar wavelet of size 2σ is calculated. After fixing the gauge coordinates for each of these pixels, gauge invariants |L ww |, |L vv | are computed. The parameters of the GSURF descriptor are mathematically defined as follows: A resultant feature descriptor for each sub-region will be four-dimensional vector V d = (∑L ww , ∑L vv , ∑|L ww |, ∑|L vv |). Resultant feature vector will be 64 × J dimensional, where J represents a number of the interest points of the features that are chosen automatically by the descriptor depending upon the contents of the image, mathematically, it can be expressed as follows: where b 1d to b nd are feature descriptors of the GSURF descriptor.
To detect objects within images, their location and spatial orientation of edges are of high significance. Using the HOG descriptor to extract such information results in poor performance because of difficulty in the selection of appropriate window size, as the window captures either too much or too less of local edge structure. Similarly, the standard SURF descriptor utilizes the Gaussian scale space, which incorporates blurring as a pre-processing step to remove noise. However, this step resulted in the removal of structure details such as edges. Therefore, a fusion of adaptive complementary visual words obtained through a bilateral filter (BF)-based gradient field HOG [25] and gauge SURF descriptors is proposed in this article to overcome said issues. In the next two steps, features from both the descriptors are weighted for optimal feature selection, which can reduce training time (computational cost) and improve the performance of the proposed method.

Latent semantic analysis as a dimension reduction mechanism
The feature vectors extracted in the previous steps exhibit high dimensionality, which generates issues in constructing compact feature interpretation of the image as there exist redundancy and multiple correlations among certain feature points. To get robust and discriminative features, a latent semantic analysis (LSA) method is applied to each feature vector to easily perceive and preserve data, while reducing storage and computational cost. Deerwester et al. [41] applied this method for document retrieval systems, which is based on a singular value decomposition (SVD) mechanism. The proposed method uses LSA to construct a term-context matrix A of dimension r × q for each extracted feature vector, which highlights the hidden relationship among semantically similar images. In the case of the proposed method of CBIR, each column A represents a resultant feature vector (i.e., refers to F a (defined in Eq. (7)) in case of BFGF-HOG resultant feature vector, while it refers to F b (defined in Eq. (13) between r th term and q th context. The key step of LSA is SVD, which decomposes the high-dimensional term-context matrix A into three matrices U, Z,and V of smaller dimensions d, represented mathematically as follows: where U, V are orthogonal matrices and Z is the diagonal matrix. The columns of U and V contain orthonormal eigenvectors of AA T and A T A,respectively, while the diagonal matrix contains singular values, which are square roots of eigenvalues from U or V. The values of the diagonal matrix Z are sorted in descending order, so the significant information can be retained by considering higher values while eliminating the lower values/noise. For dimension d, the reduced matrix can then be represented as follows: In the next step, reduced features from both descriptors are weighted for optimal feature selection.

Adaptive feature weighting based on self-paced learning
In computer vision-based applications, some features of the image are more significant than the others. The proposed method applies an adaptive feature weighting method to each reduced size feature vector to classify features as significant or insignificant based on the self-paced learning (SPL) method [42]. The SPL dynamically pick features and learn in an easy to hard learning fashion. Given a matrix of extracted LSA features X = [X 1 , X 2 , X 3 , …, X n ] (where X = > A d ) and y as the corresponding class label, the objective function of SPL can be defined mathematically as follows: where t and λ(t) denote the representation coefficient and regularization parameter, respectively. A weight variable w is added in Eq. (16) to assign a higher or lower value of weights to each feature categorize as easy or hard. Equation (16) can then be mathematically transformed as follows: where γ, X i , y i are the learning parameters, which controls the selection of learning sample, vector of the i th training feature, and i th feature of a test sample, respectively. The value of l is higher for the initial learning sample, which yields smaller losses and decreases gradually when hard samples are selected. The process continues until all the samples are selected. The features are selected by setting a threshold which is mathematically described as: where f i = (y i − X i α) 2 . In the next step, feature vectors of the adaptive feature weighting are clustered separately using an adaptive fuzzy k-means clustering algorithm, whose details are provided in the following section. The framework of the first competitive method of non-adaptive complementary features integration method is shown in Fig. 4. While in the case of the non-adaptive complementary visual words integration method (second competitive method), all the framework is the same as shown in Fig. 3, except that it does not use the adaptive feature weighting (AFW) to analyze its image retrieval performance.

Adaptive fuzzy k-means clustering for complementary visual vocabulary formation
In this step, the visual vocabulary is built by applying adaptive fuzzy k-means (AFKM) clustering on the optimized adaptive features of BFGF-HOG and GSURF descriptors of the whole data of the training images. The AFKM clustering is an improved version of the k-means clustering algorithm. It is one of the frequently used unsupervised, nondeterministic, and iterative clustering algorithms. However, initialization of the cluster center, the number of clusters, sensitivity to noise, and outliers are some of the shortcomings of the standard k-means algorithm. To overcome these issues, the proposed method of CBIR uses the AFKM clustering algorithm [43]. It is a combination of moving k-means (MKM) [44] and fuzzy c-means (FCM) [45] clustering algorithms. The MKM clustering contributes to an assignment of data to its closest center and FCM allows data to belong to two or more clusters. For a point x and cluster center c, the objective function of AFKM clustering is calculated as follows: where E m ij represent a fuzzy membership function and m represent a fuzziness exponent. The level of being in a specific group is inverse of the distance to clusters. The new position for each centroid is calculated as follows: In AFKM clustering, the concept of belongingness is introduced to improve clustering. The belongingness estimates the relationship between the cluster center and its members. The degree of belonging is calculated using the following mathematical equation: The proposed method of CBIR minimizes the AFKM's objective function, defined in Eq. (19). In AFKM, the clustering is iteratively performed until the center is converged and all data can be considered. In the AFKM clustering, cluster heads of the formed clusters are then termed as visual words, which are grouped to form a visual vocabulary. The proposed method of image retrieval formulates two visual vocabularies, which are represented by W A = {a 1 , a 2 , a 3 , ⋯, a i }, where a 1 to a i represent the visual words of BFGF-HOG feature vector and W B = {b 1 , b 2 , b 3 , ⋯, b j }, where b 1 to b j represent the visual words of the GSURF feature vector. After that, both visual vocabularies are concatenated vertically to form combined visual vocabulary denoted by W AB = W A + W B = {W A ; W B } of size i + j visual words to achieve complementary features by integrating visual words of both descriptors in the proposed method.

Image representation as a histogram
In this phase, salient objects of an image are transformed into a histogram, which is formed using fused visual words from the complementary visual vocabulary. Assume that the total no. of visual words in the complementary visual vocabulary (termed as W AB in the previous step) are denoted by T. Consider D j denote the number of descriptors, which are mapped to the j th visual word ab j , then the cardinality of D j is the j th bin of the histogram of visual word ab j , which is mathematically denoted as follows: The obtained histograms are then forward to a classifier for learning a model that can classify images semantically.

Image classification
In this step, the proposed method uses quadratic kernel-based SVM (QSVM) to perform image classification. The histograms of the training images along with labels of each class act as inputs to the QSVM for image classification in the proposed method. To improve retrieval efficiency and accuracy of any CBIR system, image classification is regarded as one of the vital steps. The SVM [46] is one of the frequently used classifiers and has been applied in various computer vision-based applications because of its outstanding generalization ability. Given a linear training set {(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), …. (x n , y n )}, where y 1, 2 , …, n = {1, −1} are the corresponding class labels, SVM classifies linear data as defined in Eq. (23). It defines decision boundaries known as hyperplanes by focusing on data points that lie at the edges of classified class distributions, which are also known as support vectors. Mathematically, it is defined as: where w, x, and b represent weight vector, sample point of the training set, and bias, respectively. For hyperplane to be optimal, SVM tries to (i) maximize the margin between support vectors and (ii) reduces misclassification by introducing slack variable ξas defined in Eq. (24): Subject to y w T where ξ represents a misclassified sample of corresponding hyperplane and R represents a tradeoff between margin maximization and misclassification error. The higher the value of R, error reduction will be predominant, and for lower values of R, margin maximization will be emphasized. For non-linear data points, the traditional SVM algorithm fails to converge hence consumes more processing time and it also affects image retrieval accuracy. As a solution, SVM utilizes kernel functions k to map data points into new feature space also known as kernel space. The transformed equation for hyperplane is then represented as: where k(x m , x i ) = ϕ(x m ). ϕ(x i ) is a kernel function that uses a non-linear mapping ϕ, which maps the data points to kernel space and α i is the Lagrange multiplier. Among several available kernels, the proposed method uses the polynomial kernel of degree 2, also known as the quadratic kernel. It has low running costs as compared to RBF, sigmoid, and other higher-order kernels, and it also produces a robust performance of the image classification. It is mathematically represented as follows: 3.8 Performance testing section of the methodology As mentioned earlier, a query image from the test group of the images is selected that undergoes all the steps mentioned in the training section. The similarity between a query image and dataset images is computed using Euclidean distance. The retrieval accuracy of the proposed method is further improved by incorporating log-based RF. The details about these two modules are presented in the following subsequent sections.

Retrieval of the images based on the similarity measure
Given a query image q, a set of similar images are retrieved by computing relevance score between query image and images in the datasets denoted as I DB . For this purpose, Euclidean distance is utilized as a measure of relevance score. Mathematically, it is defined as follows:

Log-based relevance feedback
To improve the performance of image retrieval, the proposed method uses log-based relevance feedback (LRF) method for CBIR. It integrates user feedback along with low-level features to further improve the learning process of a CBIR system. Traditional relevance feedback (RF) methods for CBIR require several iterations to return satisfactory results, which are considered time-consuming and tedious from a user perspective. In [47], an active learning approach is proposed that requires a user to label extra images retrieved by the system as most informative. The CBIR methods based on RF have been studied immensely and the one based on RF logs is presented in [48]. The proposed method uses the LRF method, which starts with a query image (represented by q) and its corresponding retrieved images (represented by N), which are marked by a user as relevant or irrelevant. The user judgment is then saved in a history log, and a relevance matrix R is created from all log sessions. In the case of relevant, irrelevant, and non-judged images in log sessions, a cell in R is marked as +1, − 1, and 0, respectively. The LRF method aims to look for a function f q that can map images to a relevance degree between 0 and 1.
As LRF method utilizes low-level features (i.e., BFGF-HOG and GSURF features) and log sessions, so the overall function f q can be defined mathematically as follows: where f R and f x are relevance functions based on log-based data and low-level features of images, respectively. To find relevance between two images I i and I j , the correlation between log data l i and l j of these images is calculated, which is mathematically defined as: where The f R for a log session k can be calculated using the following mathematical equation: where corF k, i is a correlation function, L þ and L − denotes a set of relevant and irrelevant images, respectively.

Evaluation metrics, results of the experiments, and discussions
This section describes the chosen datasets, evaluation metrics, experimental results, and discussions of the proposed method. The experimental results of the proposed method are reported by performing each experiment 5 times for consistent performance. The comprehensive details of these metrics are presented in the following subsequent sections.

Evaluation metrics
To assess the performance of our proposed method, the evaluation metrics that we have used are described in detail in the subsequent sections.

Precision
The accuracy of a CBIR system in retrieving relevant images (images that belong to the same semantic class of the dataset) according to the visual contents of a query image is evaluated by precision (P), which is a ratio of images retrieved as relevant over total retrieved images. Mathematically, it is defined as follows:

Average precision
Average precision (P avg ) computes an average of precision scores (P) of all relevant retrieved images. Mathematically, it is described as: where P(j) represents the precision value of j th iteration.

Mean average precision
The mean average precision (mAP) computes the average of P avg values. Mathematically, it is expressed as follows: where k represents a number of queries of the image.

Recall
The ratio of images retrieved as relevant over the number of relevant images available in the dataset is known as recall. It is defined as follows:

F-measure
The overall success of an image retrieval system and its efficiency can also be assessed by utilizing F-measure, which is formalized by combining precision and recall as mentioned in the equation below:

Datasets, experimental parameters, results, and discussions
The performance assessment of the proposed method and its competitor methods is accomplished on four standard image datasets of CBIR, which are Corel 1000, Corel  Table 2 presents the detail of different experimental parameters, which are used to analyze the performance of the proposed method.

Comparative performance analysis on the Corel 1000 image dataset
The Corel-1000 [49] image dataset comprises a total of 1000 images, which are divided among 10 semantic categories, each having 100 images of resolution of 256 × 384 pixels or 384 × 256 pixels. The categories of the images included in this image dataset are buses, flowers, buildings, mountains, dinosaurs, human beings, food, landscape, elephants, and horses. Figure 5 presents the sample images, which are taken from each semantic category of the Corel 1000 image dataset. The experimental results of the non-adaptive complementary feature integration (first competitor method), non-adaptive complementary visual words integration (second competitor method), and the proposed adaptive complementary visual words integration methods using different sizes of the visual vocabulary are presented in Figs. 6, 10, 14, and 16. After the analysis of experimental facts presented in these figures, it can be deduced that the proposed system that is based upon adaptive complementary visual word integration produces robust performance in contrast to its competitor methods of CBIR for all the specified datasets. The size of the visual vocabulary, which produces the best performance of the proposed method, is 600 visual words and achieved mAP performance on this visual vocabulary size is 89.91% for the Corel 1000 dataset. Tables  3, 4, 5, and 6 present the performance comparison of the proposed method with its state-of-the-art image retrieval methods. It can be concluded from experimental results that the proposed method gives promising results as compared to its competitor CBIR methods due to the following reasons: (a) firstly, it uses complementary visual feature representation for salient contents of the images; (b) it uses adaptive feature weighting method based on self-paced learning to select optimized features for each image; (c) it uses twice size complementary visual words to represent salient contents of each image; (d) it uses quadratic kernel-based SVM (QSVM) to achieve robust image classification results, which ultimately improve the similarity measure process in the proposed method of CBIR; and (e) lastly, the proposed method uses log-based relevance feedback (LRF) mechanism for CBIR, which integrates user feedback along with low-level complementary features to further improve the learning process of a CBIR system. Figures 7 and 8 show the results of the image retrieval according to the salient objects of the query images. The query image (first row) of Figs. 7 and 8 are taken from the "Dinosaurs" and "Horses" categories of the Corel 1000 dataset, respectively. Furthermore, Fig. 7 shows the result of LRF-0 image retrieval. The integer value with LRF shows the iteration of the feedback. The images shown in Fig. 8 are the result of the image retrieval after applying LRF-1, which are semantically more relevant to the query image as compared to the LRF-0 retrieval result of the query image.
By varying different sizes of the visual vocabulary, the mAP performance of the proposed method, and its comparison with competitor methods, is presented in Fig. 10. After analyzing experimental details, it can be deduced that the proposed method outperforms as compared to its competitor methods on the Corel 1500 image dataset. The best mAP performance of the proposed method is obtained on a visual vocabulary of size 1000 visual words, which is 83.99%. Table 4 presents the performance comparison of the proposed method against competitive methods in terms of performance evaluation metrics of the CBIR. Based on the experimental details shown in Table 4, it can also be concluded that the proposed method also outperforms its comparative methods due to the factors mentioned in Section 4.2.1. The results of the image retrieval using the proposed method according to the salient objects of the query images of the Corel 1500 image dataset are shown in Figs. 11 and 12 for the semantic categories "Sunset" and "Postcard," respectively.

Comparative performance analysis on the Scene 15 image dataset
The Scene 15 dataset [51] comprises of 4485 gray-scale images, divided into 15 scene categories. This dataset contains images of indoor as well as outdoor scenes. There are 200 to 400 images in each semantic class of this dataset, and the resolution of each image is 300 × 250 pixels. Figure 13 shows different sample images from each semantic class of the Scene 15 image dataset. Figure 14 shows the performance comparison of the proposed method with its competitor methods in terms of the mAP performance on different sizes of the visual vocabulary. On the Scene 15 image dataset, the best mAP performance of the proposed method against its competitor CBIR methods is attained on a visual vocabulary of size 1000 visual words, which is 83.11%. To further analyze the robustness of the proposed method, its performance comparison is performed with state-of-the-art CBIR methods Table 5 Comparative analysis of competitive methods with the proposed method on the Scene  15 dataset   Performance  parameters Proposed method MTBCD [13] Optimized TPTSSR [26] MO-BoF [27] Hybrid [35] EODH-color SIFT [28]  BMM-FV CNN [17] Modified VLAD [29] Att. features+Fisher vectors [30] Fisher kernel-GMM [31]  in terms of standard performance evaluation metrics, whose details are presented in Table 5 for the Scene 15 image dataset. Different factors of the proposed method such as robust complementary image representation, efficient and effective adaptive feature weighting of visual words, twice size visual words for key objects of the image result in the robust performance of the proposed method as compared to its competitor CBIR methods.

Comparative performance analysis on the Holidays image dataset
The Holidays image dataset [52] contains 1491 images, out of which, 500 images are the query images and the remaining 991 are corresponding relevant images that are  Fig. 15.
The experimental details and comparative analysis of the effect of varying different sizes of visual vocabulary on mAP performance of the proposed method with its competitor methods are presented in Fig. 16 for the Holidays image dataset. The proposed method produces the best mAP performance of 72.85% on the visual vocabulary of size 800 visual words against its competitive methods of CBIR. The second competitor method of non-adaptive complementary visual words integration of CBIR produces best mAP performance of 62.53% on a visual vocabulary of size 600 visual words as compared to its other reported sizes of the visual vocabulary. Similarly, the best mAP performance produces by the first competitor method of non-adaptive complementary features integration method is 57.14%, which is attained on a visual vocabulary size of 600 visual words as compared to its other reported sizes of the visual vocabulary on the Holidays image dataset. The performance comparison of the proposed method with state-of-the-art CBIR methods is presented in Table 6, which concludes that the  proposed method produces robust performance as compared to recent CBIR methods in terms of performance evaluation metrics.

Required hardware/software resources and computational cost
The performance of the proposed method in terms of computational cost is measured using a desktop PC having following hardware and software requirements: Intel(R) Core(TM)-i3 CPU (frequency 2.1 GHz-series 2310 M), 8 GB of RAM, 120 GB SSD, Windows 7 Professional (64-bit), and MATLAB (2015b-x64 bit). The computational cost of the proposed method based on adaptive complementary visual words integration and its comparison with other competitive CBIR methods are presented in Table 7 for the Corel 1000 image dataset. In this article, we explored the effect of adaptive feature weighting and adaptive fuzzy k-means clustering on the robust representation of the principal objects of the images by integrating complementary visual words of the local and global features based on the BoVW methodology. The latent semantic analysis is applied to the adaptive feature weighting to reduce the computational complexity of the proposed method, which is slightly increased due to the integration of the complementary visual words. The classification accuracy of the proposed method is improved using quadratic kernel-based SVM, which ultimately improved the similarity measure process of the CBIR. The logbased relevance feedback mechanism is also introduced in the proposed method to further improve the performance of the CBIR. The performance comparison of the proposed adaptive complementary visual words integration method is carried with a nonadaptive complementary feature integration method and non-adaptive complementary visual words integration method using the same local and global features as well as with state-of-the-art CBIR methods. It can be concluded that the integration of adaptive complementary visual words significantly improved the performance of the CBIR   Table 7 Computational time (in seconds) of the proposed method as compared to competitive CBIR methods Proposed method ATR+SOFT method [9] EODH method [28] Spatial L2 method [32] RSHD method [33] WATH method [34] 0 as compared with the integration of non-adaptive complementary features and nonadaptive complementary visual words integration methods due to the assignment of twice size visual words to the salient objects of the images. In the future work, due to the radically increasing volume of the image and video databases, the performance of the proposed method can be analyzed using normalized discriminative deep learningbased compressed domain methods like JPEG-2000 and HEVC to improve the accuracy and efficiency of content-based video and image retrieval systems.