### The cascade-forest model

The cascade-forest as ensemble learning approach consists of basic learners. To obtain an ensemble model with excellent performance, an individual learner (also called a basic learner) should be “good and different.” According to error-ambiguity decomposition:

$$\begin{array}{@{}rcl@{}} \text{Err}=\overline{\text{Err}}-I, \end{array} $$

(1)

where Err denotes the ensemble error, \(\overline {\text {Err}}\) denotes the mean error of individuals, and *I* denotes mean diversity of individuals. Zhou and Feng [27] open a door towards an alternative to deep neural networks (DNN). The cascade-forest is one of the major parts of [27], which gives a powerful ability that can be comparable to DNN. Layer-by-layer processing, feature transformation and sufficient model complexity are the most critical three ideas for the cascade-forest model, as shown in Fig. 2. Suppose we have two classes that need to be predicted. Consider there are four different ensemble algorithms. Black and blue are random forest and completely random forest, respectively. Let \(F_{IN}\epsilon \mathbb {R}^{m\times 1}\) as the input feature vector and *m* represent the dimensions of the input features. Features after the first cascade layer concatenates with *F*_{IN} as features \(F_{1}\in \mathbb {R}^{(d+m)\times 1}\).

$$ F_{1}=H_{\text{CASF}_{1}}(F_{IN})\oplus F_{IN}, $$

(2)

where \(\phantom {\dot {i}\!}H_{\text {CASF}_{1}}(\cdot)\) denotes the first cascade operation, and ⊕ means concatenation operation. \(F_{1}\in \mathbb {R}^{(d+m)\times 1}\) is used as the input features to the second cascade layer, where *d* represents the dimension of the output features. Then, we can obtain the following operation:

$$ F_{2}=H_{\text{CASF}_{2}}(F_{1}) \oplus F_{IN}, $$

(3)

where \(\phantom {\dot {i}\!}H_{\text {CASF}_{2}}(\cdot)\) denotes the second cascade operation. Supposing we have *N* layers, the output features *F*_{N} can be acquired by:

$${} \begin{aligned} F_{N}\!&=\!H_{\text{CASF}_{N}}(F_{N-1})\\ &=\!H_{\text{CASF}_{N}}(H_{\text{CASF}_{N-1}}(... H_{\text{CASF}_{1}}(F_{IN})\oplus F_{IN}...)\oplus\ F_{IN}), \end{aligned} $$

(4)

where \(\phantom {\dot {i}\!}H_{\text {CASF}_{N}}(\cdot)\) denotes the *N* cascade operation. At last, the prediction value can be obtained by

$$ \text{Prediction}=\text{Max}(\text{Ave}(F_{N})), $$

(5)

where Ave(·) indicates the average operation and Max(·) denotes the maximum operation. The last prediction will be one or zero.

With a cascade structure, the cascade-forest can process data layer-by-layer. Therefore, it allows the cascade-forest to perform the representation learning. Secondly, the cascade-forest can autonomously control the number of cascade layers so that the model can adjust complexities based on the amount of data. Even with small data, the cascade-forest model performs well. More importantly, by concatenating features, the cascade-forest model makes a feature transformation and retains the original features to continue processing. In a nutshell, the model can be concerned as “ensemble of ensembles.”

The model, presented in Fig. 2, is noticed in detail that each level is an ensemble of distinctive classification algorithms. In this paper, we apply four different classification algorithms. Each algorithm will generate a prediction of the distribution of classes. For instance, by calculating the proportion of different classes of training samples that was predicted by each base classifier, then the class vector is obtained through averaging all base classifiers in the same classification algorithm. Extreme gradient boosting (XGBoost) [28] is integrated by classification and regression trees(CART), which is based on boosting ensemble learning and joint decision-making by multiple associated decision trees. Boosting training of the base learner adopting re-weighting and re-sampling. The goal of XGBoost does not directly optimize the entire model. It optimizes the model in steps. The first tree is optimized, and then the second tree is optimized until the last tree is optimized.

Besides, we utilize a completely random forest and a random forest [27]. As we all know, random forest randomly selects *n* numbers of features from input features as a candidate and then selects the best one through calculating *GINI* value for splitting. Instead, completely random forest only randomly selects one feature for spite from input features.

Furthermore, in the classification task, there are negative class (zero) and positive class (one). Logistic regression is a typical two-class classification model. We employ logistic regression to increase the diversity of ensemble learning. The objective function of logistic regression is described as:

$$ \begin{aligned} \text{Loss}(\Theta)=&-\left\{\frac{1}{k}\sum_{i=1}^{k}((1-q^{(i)})\times \log(1-h_{\Theta}(x^{(i)}))\right.\\ &\left. +\ q^{(i)}\times \log(h_{\Theta}(x^{(i)}))){\vphantom{\sum_{i=1}^{k}}}\right\}\\ &+ \frac{\lambda }{2k}\sum_{j=1}^{m}\Theta_{j}^{2}. \end{aligned} $$

(6)

Among Eq. (6)

$$ h_{\Theta}(x^{(i)})={1}/({1+e^{-\Theta^{T}x}}), $$

(7)

where *k* indicates the number of input samples, *q*^{(i)}*ε*(0,1) denotes the label of samples, *m* represents the dimension of input feature, *h*_{Θ}(*x*^{(i)}) is called sigmoid function, \(\frac {\lambda }{2k}\sum _{j=1}^{m}\Theta _{j}^{2}\) is the regularization of loss function, *λ* is hyper-parameter, *x* is the input feature vector, and \(\Theta \in \mathbb {R}^{m\times 1}\) as a vector represents the optimization parameters of the model.

In summary, the cascade-forest includes four different types of algorithms to enhance the diversity discussed before. Combining four distinctive algorithms achieves excellent performance. The outstanding classification effect of the cascade-forest has been confirmed in [27].

### Guided filter

Due to the neighborhood processing in spatial focusing measurements, the boundaries between the focused and non-focused regions are usually inaccurate. Especially in the spatial domain, this problem will result in undesirable artifacts around the transition boundary. Similar to [25] and [17], we make use of the GF [29, 30] to refine the initial decision map. The GF has excellent characteristics of edge-reservation, which can be expressed as follows:

$$ Q_{i}=a_{k}I_{i}+b_{k},\forall i \epsilon w_{k}, $$

(8)

where *Q* indicates the output image, *I* indicates the guided image, *a*_{k} and *b*_{k} are the invariant coefficients of the linear function when the window center is located at *k*, and *w*_{k} is a local window with size of (2*w*+1)×(2*w*+1). Supposing that *P* is the result before *Q* filtering, then *Q*_{i}=*P*_{i}−*N*_{i}, where *N*_{i} represents the noise. The filtering result is equivalent to the minimization of the following equation:

$$ E(a_{k},b_{k})=\sum_{i\epsilon w_{k}}((a_{k}I_{i}+b_{k}-P_{i})^{2}+\varepsilon a_{k}^{2}). $$

(9)

Then, results can be expressed as:

$$ a_{k}=\frac{\frac{1}{\left | w \right |}\sum_{i\epsilon w_{k}}I_{i}P_{i}-\mu_{k} \bar{P_{k}}}{\sigma_{k}^{2}+\varepsilon} $$

(10)

and

$$ b_{k}=\bar{P_{k}}-a_{k}\mu_{k}. $$

(11)

In this expression, *μ*_{k} and \(\sigma _{k}^{2}\) represent the mean and variance of image *I* in the local box *w*_{k}, respectively. \(\bar {P_{k}}\) is the mean of *P* in the local box, and |*w*| indicates the amount of pixels in *w*_{k}. The initial decision map *I* is used as the guidance image for filtering, and then obtain the final decision map.

### Cascade-forest for image fusion

Multi-focus image fusion synthesizes source images about the same target with distinctive focal settings. Thence, we regard the source images as consisting of many different image patches. To obtain high-quality fused results, we must carefully determine each patch of source images. Then, by determining the source image, patches are clear or blurred to acquire a focus map. We can regard determination as a classification issue. As [25] elaborated, feature extraction corresponds to the activity level measurement while classification can be regarded as the role of fusion rule. The classification task means to obtain a focus map, which is crucial for the following image fusion [21]. It is known that clear and blurred patches are relative. Thereafter, the source images are decomposed into patches of a specific size, and four features that can represent clarity are extracted from image patches. More information about these features is discussed in the next section. These features can effectively distinguish between clear and blurred images, which helps train the model. We obtain the final prediction through layer-by-layer processing of the cascade-forest to enhance representation learning. For the final prediction which as a class label vector, accuracy is extremely critical. More importantly, cascaded-forest can acquire more accurate label vectors, which makes it more competitive than other traditional methods, and the cascade-forest-based method can generate higher-quality fusion images.