In this section, at first, we explain the SPM model which is used to represent images. Then, we introduce the designed feature fusion-based LMKL algorithm and its optimization problem in detail. Finally, the optimization strategy to solve the problem is discussed.

### Image representation by SPM model

Introducing the bag of word (BoW) model to compute image feature significantly improves the performance of image classification systems [30]. Pyramid matching is a BoW based model to approximate the similarity between two images [31]. In this model, a pyramid of grids is placed on the feature space at different resolutions. At each resolution level, the corresponding histogram of the image is computed. The weighted sum of histograms is computed such that finer resolutions get higher weights. Finally, the intersection kernel is applied on the weighted histograms of two images to approximate their correspondence. The main shortcoming of the pyramid matching method is that it discards the spatial information of images which plays an important role in the performance of image classification systems. Lazebnik et al. proposed the spatial pyramid match (SPM) approach to address the mentioned problem [1]. By extending BoW, the SPM method divides the original image into sub-regions in a pyramid manner and computes histograms of features in each sub-region separately. The final representation of the image is the concatenation of extracted histograms.

### Preliminaries and formulation of feature fusion based LMKL

Consider the classification task as \( D={\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^N \) where *N* is the number of samples, *x*
_{
i
} denotes the *i*
^{th} sample and *y*
_{
i
} = {±1} is the corresponding label for binary classification. In the MKL framework, multiple kernels are combined as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k{K}_k\left({x}_i,{x}_j\right) $$

(1)

where *m* is the number of kernels and *π*
_{
k
} is the weight of *k*
^{th} kernel.

The discriminator function *f*(*x*
_{
i
}) for a test data *x*
_{
i
} in the standard MKL framework is formulated as follows:

$$ f\left({x}_i\right)=\sum \limits_{k=1}^m{\pi}_k\left\langle {w}_k^T,{\varphi}_k\left({x}_i\right)\right\rangle +b $$

(2)

where *φ*
_{
k
}(*x*
_{
i
}) represents the *k*
^{th} mapping function, and *w*
_{
k
} and *b* are SVM parameters.

The standard framework of MKL assigns fixed weights to kernels in the entire space. As discussed in section 1.2, because of the large intraclass variance and inter class relationship in complicated spaces, such as an image feature space, similar weights for kernels are not suitable. For example, in some cases the kernel based on color information is more informative than a texture based kernel. Therefore, a more accurate classifier will be achieved if variable weights are assigned to a kernel in different areas of the space.

Gönen and Alpaydin proposed a localized MKL (LMKL) framework in which the weights of kernels are calculated distinctly for each training sample [19]. The localized version of *K*(*x*
_{
i
}, *x*
_{
j
}) is as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k\left({x}_i\right){\pi}_k\left({x}_j\right){K}_k\left({x}_i,{x}_j\right) $$

(3)

where *π*
_{
k
}(*x*
_{
i
}) is the weight of k^{th} kernel corresponding to *x*
_{
i
}.

In the original LMKL framework, Gönen et al. assumed that kernels are computed based on a single feature. In the proposed algorithm, multiple kernels are computed based on multiple features. Using multiple features instead of a single one results in a more accurate classifier in an image classification task as discussed in section 1.1. The kernel value between two images *x*
_{
i
} and *x*
_{
j
} is computed as follows:

$$ K\left({x}_i,{x}_j\right)=\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) $$

(4)

where \( {x}_i^k \) is a representation of training sample *x*
_{
i
} corresponding to the *k*
^{th} feature.

The combined kernel of (4) changes the standard kernel based margin maximization problem of SVM into a non-convex optimization problem. Instead of solving this difficult optimization problem, Gönen et al. estimated kernel weights by using the gating function.

A gating function formulates the effectiveness of the *k*
^{th} kernel in classification of sample *x*
_{
i
}. There are several ways to calculate the gating function. Sigmoid function formulated in (5) is a good choice and was used by Gönen et al. [19]:

$$ {\pi}_k\left({x}_i^k\right)=1/\left(1+\exp \left(-\left\langle {v}_k,{x}_i^k\right\rangle -{v}_{k0}\right)\right) $$

(5)

where *v*
_{
k
} and *v*
_{
k0} are the parameters of the gating function. As stated before, \( {x}_i^k \) is a representation of training sample *x*
_{
i
} corresponding to *k*
^{th} feature which is in the form of a SPM histogram. Comparing the SPM histograms by their inner product is not accurate enough. *Χ*
^{2} kernel is a better choice in histogram comparison. Therefore, we modified the gating function of (5) by using the *Χ*
^{2} kernel instead of the inner product. The *Χ*
^{2} kernel based gating function is as follows:

$$ {\pi}_k\left({x}^k\right)=1/\left(1+\mathit{\exp}\left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right) $$

(6)

where *Χ*
^{2} kernel is defined as:

$$ {X}^2\left({v}_k(i),x(i)\right)=2{v}_k(i)x(i)/\left({v}_k(i)+x(i)\right),i=1\dots DG $$

(7)

DG is the dimension of feature space.

Because of the efficiency of *Χ*
^{2} kernel in computing the similarity of SPM histograms, we use the following gating function as well:

$$ {\pi}_k\left({x}^k\right)={X}^2\left({v}_k,{x}^k\right)+{v}_{k0} $$

(8)

### Optimization strategy

By plugging local kernel weights in standard MKL formulation, the following optimization problem will result:

$$ {\displaystyle \begin{array}{c}{\min}_{\left\{{w}_k\right\},b,\left\{{\xi}_i\right\},\left\{{v}_k\right\},\left\{{v}_{k0}\right\}}\frac{1}{2}\sum \limits_{k=1}^m\parallel {w_k}^2\parallel +C\sum \limits_{i=1}^N{\xi}_i\\ {} subject to\kern1em {y}_i\left(\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right)\left\langle {w}_k,{\varPhi}_k\left({x}_i^k\right)\right\rangle +b\right)\ge 1-{\xi}_i\kern1.25em i=1\dots N,\kern1em {\xi}_i\ge 0\end{array}} $$

(9)

where *C* is the regularization parameter and *ξ*
_{
i
}s are the slack variables.

Since standard MKL is a convex optimization problem it can be solved by common optimization methods. Combining nonlinear gating functions with standard MKL problem changes the convex optimization problem of MKL into a nonlinear and non-convex problem. This problem can be solved using the alternate optimization method, which is an iterative two step approach. In step one, some parameters are assumed to be fixed and the others are computed by solving the optimization problem. In step two, the non-fixed parameters in the first step are considered to be fixed and the remaining parameters are calculated by solving the new optimization problem. The optimization algorithm iterates until convergence. We considered two termination criteria: the maximum number of iterations and reaching the changes of object function below a predefined threshold.

*Step one: Learning SVM parameters.*

In this step, the optimization problem should be minimized with respect to *w*
_{
k
}, *ξ*
_{
i
} and *b*, while *v*
_{
k
} and *v*
_{
k0} are fixed. In order to remove the constraints, the Lagrangian of problem (9) is calculated and the following problem is obtained:

$$ L\left(\left\{{w}_k\right\},b,\left\{{\xi}_i\right\},\left\{{\lambda}_i\right\},\left\{{\eta}_i\right\}\right)=\frac{1}{2}\sum \limits_{k=1}^m{\left\Vert {w}_k\right\Vert}^2+\sum \limits_{i=1}^N\left(C-{\lambda}_i-{\eta}_i\right){\xi}_i+\sum \limits_{i=1}^N{\lambda}_i-\sum \limits_{i=1}^N{\lambda}_i{y}_i\left(\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right)\left\langle {w}_k,{\varPhi}_k\left({x}_i^k\right)\right\rangle +b\right) $$

(10)

where *λ*
_{
i
} and *η*
_{
i
} are Lagrangian parameters.

Calculating the derivatives of (10) with respect to {*w*
_{
k
}}, *b* and *ξ*
_{
i
} will result in:

$$ {\displaystyle \begin{array}{c}\partial L/\partial {w}_k=0\Rightarrow {w}_k-\sum \limits_{i=1}^N{\lambda}_i{y}_i{\pi}_k\left({x}_i^k\right){\varPhi}_k\left({x}_i^k\right)=0\\ {}\partial L/\partial b=0\Rightarrow \sum \limits_{i=1}^N{\lambda}_i{y}_i=0\ \\ {}\partial L/\partial {\xi}_i=0\Rightarrow C-{\lambda}_i-{\eta}_i=0\end{array}} $$

(11)

substituting (11) in (10), the dual problem of (10) is obtained:

$$ {\displaystyle \begin{array}{l}J={\mathit{\max}}_{\left\{{\lambda}_i\right\}}\sum \limits_{i=1}^N{\lambda}_i-\sum \limits_{i=1}^N\sum \limits_{j=1}^N{\lambda}_i{\lambda}_j{y}_i{y}_j\sum \limits_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right)\\ {} such that\ \sum \limits_{i=1}^n{\lambda}_i{y}_i=0,\kern0.5em 0\le {\lambda}_i\le C\ \end{array}} $$

(12)

where \( {K}_k\left({x}_i^k,{x}_j^k\right)={\varPhi}_k\left({x}_i^k\right){\varPhi}_k\left({x}_j^k\right) \).

If we prove that the localized weighted sum of kernels \( {\sum}_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semi definite kernel matrix, then (12) can be solved as a standard canonical SVM problem.

In order to prove that the localized weighted sum of kernels is positive semi definite, we use the definition of a quasi-conformal transformation. For a positive function *c*(*x*), a quasi-conformal transformation of *K*(*x*, *y*) is defined as follows:

$$ \tilde{K}\left(x,y\right)=c(x)c(y)K\left(x,y\right) $$

(13)

The gating function in (6) and (8) used in our experiments always provide positive values; therefore, \( {\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_{\mathrm{k}}\left({x}_i^k,{x}_j^k\right) \) in (4) is a quasi-conformal transformation of *K*(*x*, *y*). Positive semidefinite kernels are closed under quasi-conformal transformation [32], so \( {\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){\mathrm{K}}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semi-definite kernel. On the other hand, summing up several kernels together leads to a single kernel. Thus, \( {\sum}_{k=1}^m{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right) \) is a positive semidefinite kernel as well and (12) is considered as a canonical SVM that can be solved by common approaches.

*Step Two: Learning locality function parameters.*

To determine the values of parameters in gating functions, we use the gradient descent method such that the derivatives of the dual problem of (12) are calculated with respect to *v*
_{
k
} and *v*
_{
k0} while {*w*
_{
k
}}, *b* and *ξ*
_{
i
} are fixed. The step size of each iteration is determined by a line search method. Taking derivatives of problem (12) with respect to *v*
_{
k
} and *v*
_{
k0} we obtain:

$$ {\displaystyle \begin{array}{l}\partial J/\partial {v}_k=-\frac{1}{2}\sum \limits_{i=1}^N\sum \limits_{j=1}^N\sum \limits_{k=1}^m{\lambda}_i{\lambda}_j{y}_i{y}_j{K}_k\left({x}_i^k,{x}_j^k\right)\left({\pi_k}^{\hbox{'}}\left({x}_i^k\right){\pi}_k\left({x}_j^k\right)+{\pi}_k\left({x}_i^k\right){\pi_k}^{\hbox{'}}\left({x}_j^k\right)\right)\\ {}\partial J/\partial {v}_{k0}=-\frac{1}{2}\sum \limits_{i=1}^N\sum \limits_{j=1}^N\sum \limits_{k=1}^m{\lambda}_i{\lambda}_j{y}_i{y}_j{\pi}_k\left({x}_i^k\right){\pi}_k\left({x}_j^k\right){K}_k\left({x}_i^k,{x}_j^k\right)\left(2-{\pi}_k\left({x}_i^k\right)-{\pi}_k\left({x}_j^k\right)\right)\end{array}} $$

(14)

where π_{k}
^{′}(*x*) is defined as (15) for the *Χ*
^{2} gating function of (8),

$$ {\left\{2\left(x(i)\left(x(i)+{v}_k(i)\right)-x(i){v}_k(i)\right)/{\left(x(i)+{v}_k(i)\right)}^2\right\}}_{i=1}^{DG} $$

(15)

also π_{k}
^{′}(*x*) is defined as (16) for the *Χ*
^{2} kernel based sigmoid function of (6),

$$ A\left(\exp \left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right)/{\left(1+\exp \left(\left(-{X}^2\left({v}_k,{x}^k\right)-{v}_{k0}\right)\right)\right)}^2 $$

(16)

where *A* is equal to (15).

The block diagram of the optimization strategy to find the parameters of the training model is depicted in Fig. 4.