Skip to main content

SIP-FS: a novel feature selection for data representation


Multiple features are widely used to characterize real-world datasets. It is desirable to select leading features with stability and interpretability from a set of distinct features for a comprehensive data description. However, most of existing feature selection methods focus on the predictability (e.g., prediction accuracy) of selected results yet neglect stability. To obtain compact data representation, a novel feature selection method is proposed to improve stability, and interpretability without sacrificing predictability (SIP-FS). Instead of mutual information, generalized correlation is adopted in minimal redundancy maximal relevance to measure the relation between different feature types. Several feature types (each contains a certain number of features) can then be selected and evaluated quantitatively to determine what types contribute to a specific class, thereby enhancing the so-called interpretability of features. Moreover, stability is introduced in the criterion of SIP-FS to obtain consistent results of ranking. We conduct experiments on three publicly available datasets using one-versus-all strategy to select class-specific features. The experiments illustrate that SIP-FS achieves significant performance improvements in terms of stability and interpretability with desirable prediction accuracy and indicates advantages over several state-of-the-art approaches.

1 Introduction

Nowadays, massive amounts of image data are available in our daily life, including web images and remote sensing images. Numerous features have been proposed to characterize an image, such as global features (color, GIST, shape, and texture) and local features (shape context, and histograms of oriented gradients). For texture feature, the total number of texture features is up to 30 types, such as local binary pattern (LBP) [1] and Gabor textures [2]. For color feature, there also exist several types, such as color histogram and color correlogram. Generally, images are always described by multiple features which are complementary to each other, thus selecting effective feature subset from a set of distinct features is a great challenge for data representation [3].

To handle this challenge, feature selection [48] and subspace learning [9, 10] have been developed to obtain suitable feature representations. Feature selection is commonly used as a preprocessing step for classification, so most feature selection algorithms are only designed for better predictability, such as high prediction accuracy. Although many feature selections have taken both feature relevance and redundancy into account simultaneously for predictability [11], they neglect stability [12]. If a feature selection method has poor stability, the selected feature subsets change significantly due to the variation of training data. Therefore, using only predictability to evaluate feature selection methods may result in inconsistent results of ranking for data representation.

On the other hand, each feature type describes image from a single cue and has its own specific property- and domain-specific meaning. Different from a scalar feature, feature types, which can be scalars, vectors, or matrices, are highly diverse in dimension and expression. However, existing methods simply ensemble the selection of each feature type [13] or concatenate all features types into a single vector [14]. These methods ignore the relation between different feature types. Moreover, they often select a common feature subset for all classes, while the feature subset might not be optimal for each class. According to ref. [14], one-versus-all strategy is employed to select class-specific features. Feature selection selects a subset from original features rather than obtain a low-dimensional subspace, thereby maintaining the physical meaning, which is beneficial for understanding of data [4]. Therefore, how to select a set of feature types and evaluate the contribution of these types for a specific class is critical for enhancing their interpretability of features.

To address the above-mentioned issues, a novel feature selection method is proposed to improve stability and interpretability without sacrificing predictability, which is the so-called SIP-FS. The main contributions of this paper are as follows. First, generalized correlation rather than mutual information is employed in minimal redundancy maximal relevance to determine what feature types contribute to a specific class, thereby enhancing the interpretability of features. Second, stability constraint is adopted in SIP-FS to select consistent results of ranking in the case of data variation.

The remainder of this paper is organized as follows. Section 2 presents the related work of feature selection including predictability, interpretability, and stability. Section 3 illustrates the proposed methodology and other feature selection methods using different criteria based on predictability, stability and interpretability. SIP-FS is presented in Section 4. Section 5 discusses the effects of parameters and performance comparisons of different methods. Finally, Section 6 concludes this paper

2 Related work of feature selection

2.1 Predictability

As an important technique for handling high-dimensional data, feature selection plays an important role in pattern recognition and machine learning. It can be divided into four categories: filter, wrapper, embedded, and hybrid methods [4]. In this study, we focus on the filter methods based on different evaluation measures, such as distance criterion (Relief and its variants ReliefF, IRelief [15]), separability criterion (Fisher Score [16]), correlation coefficient [17], consistency [18], and mutual information [11]. More details can refer to ref. [19]. In general, one-versus-all strategy is becoming increasingly used in feature selection methods to select class-specific features for a certain class rather than a common feature subset for all classes [14].

2.2 Interpretability

Most existing feature selection methods focus on predictability (e.g., prediction accuracy) without considering the correlation between different feature types, weakening the interpretability of selected results. However, different feature types exhibit various information, including statistical characteristics and domain-specific meanings. Given a set of distinct feature types, it remains unclear what feature types contribute to a specific class.

Haur et al. analyze the influence of feature selection methods on functional interpretability of the signatures [20]. Li et al. utilize association rule mining algorithms to improve the interpretability of the selected result without degrading prediction accuracy [21]. However, these feature selections are with less consideration of the correlation between two feature types. For different feature types, learning a shared subspace for all classes is a popular strategy to reduce the dimensionality. Although subspace-based methods are suitable for high-dimensional data, it learns a linear or non-linear embedding transformation rather than selects relevant and significant features from original feature types.

Thus, feature selection is becoming increasingly applied to obtain compact data representation. For example, Wang et al. [22] and Somol et al. [23] proposed to select the most discriminative feature types based on the relationships between different feature types, both methods are sparse feature selections rather than filter methods.

2.3 Stability

Feature selections can obtain inconsistent results with similar prediction accuracies in the case of data variation. However, a good feature selection method should be robust to data variation. Therefore, it is necessary to develop a stability measure for the results of different feature selections. To evaluate stability, numerous stability measures have been proposed. For example, Somol et al. [24] proposed a series of stability measurement, such as feature-focused versus subset-focused measures, selection-registering versus selection-exclusion-registering measures, and subset-size-biased versus subset-size-unbiased measures. At present, a wide variety of stability measures based on physical properties are defined for the comparison of feature subsets, including Hamming distance [25], Tanimoto distance [26], Average Tanimoto index [27], Ochiai coefficient [28], and other stability measures for subsets with different sizes [24]. For example, Spearman’s correlation [26] is used to measure the stability of two weighting vectors, where the top ranked features are set higher weights.

Many factors greatly affect the stability of feature selection, such as the number of samples and the criteria and complexity of feature selection. Although stability measures are widely used for evaluating the selected results, it is seldom incorporated into feature selection methods. To improve stability, numerous stable feature selection methods have been developed to deal with different sources of instabilities. These methods can be divided into four categories: (1) ensemble methods [2931], (2) sample weighting [32], (3) feature grouping [33], and (4) sample injection method [34]. In general, ensemble feature selection is the most popular topic compared with the others. An ensemble feature selection method consists of two steps: (1) creating a set of component feature selectors and (2) aggregating the results of component feature selectors into an ensemble output.

However, ensemble feature selection methods combine the selected results according to prediction accuracy, which may result in imbalance between stability and predictability. By contrast, the proposed SIP-FS adopts stability measure as an additional constrain in selection criterion to balance predictability and stability. To the best of our knowledge, both stability and interpretability are seldom explored simultaneously in existing feature selection methods.

3 Methodology

This section presents feature selections and their corresponding results using different criteria based on predictability, stability, and interpretability, as shown in Fig. 1. Suppose a feature set F with m -dimensional features f l is extracted using l different types for each image, denoted by F=[f1,f2,…f m ]. If the length of a given feature type G(i) is m i dimensions, denotes by \(G^{(i)} = \left [f_{1}^{(i)}, f_{2}^{(i)},\ldots,f_{m_{i}}^{(i)} \right ], \sum _{i=1}^{l}{m_{i}}=m\), then F can be denoted as \(F^{G}=\left [ G^{(1)},G^{(2)},\ldots,G^{(l)}\right ]=\left [ f_{1}^{(1)},f_{2}^{(1)},\ldots,f_{m_{l}}^{(1)},\ldots {f}_{1}^{(2)},f_{2}^{(2)},\ldots {f}_{m_{l}}^{(2)},\ldots {f}_{1}^{(l)},f_{2}^{(l)},\ldots {f}_{m_{l}}^{(l)}\right ]\). As shown in Fig. 1a, G(i) represents the i-th feature type with a specific color (green, yellow, red, etc); moreover, G(i) has its own specific property and dimensionality.

Fig. 1
figure 1

Feature selection criteria based on predictability, stability, and interpretability. al different types of feature, corresponding to l different colors. bd Three criteria with different combinations. e SIP-FS

For predictability, numerous filter models have been developed in feature selection. For example, Min-Redundancy and Max-Relevance (mRMR) [11], as a popular filter model, adopts the following criterion:

$$ f_{opt}=\arg\max(D-R) $$

where f opt denotes the optimal selected feature, D and R represent feature-class relevance and feature-feature redundancy, respectively. In particular, D and R are computed by:

$$ \max D(F,c), D =\frac{1}{|F|}\sum_{f_{i}\in F}I(f_{i};c) $$
$$ \max R(F), R = \frac{1}{|F|^{2}}\sum_{f_{i},f_{j}\in F}I\left(f_{i};f_{j}\right) $$

where |F| represents the dimensionality of the feature set, I(f i ;c) represents mutual information between individual feature f i in feature set F, and class c, I(f i ;f j ) represents mutual information between two individual features f i and f j in feature set F. From Eqs. (2) and (3), D and R in (1) are computed with the mean value of all feature-class relevance and feature-feature redundancy in the feature set F, respectively. In practice, the selection of the feature set can be achieved by near-optimal incremental search methods:

$$ \bar{f_{m}}=\arg\max_{f_{i}\in F-F^{\prime}}\left[I\left(f_{i},c\right)-\frac{1}{m-1}\sum_{f_{j}\in F^{\prime}}I\left(f_{i},f_{j}\right)\right] $$

where F represents m- 1-dimensional feature subset that has been already selected from F. Equation (4) aims to selecting the m-th from the candidate feature subset FF and implements trade-off between high class relevance and low feature redundancy. As shown in Fig. 1b, the features selected from the same feature type are scattering in terms of ranking, which affects the quantitative evaluation of multiple features, resulting in the lack of interpretability. In addition, the selected results may greatly change due to data fluctuation.

In addition to predictability, stability is another important measure in feature selection. Various stability evaluation indexes are only used to evaluate feature selection method rather than improve the stability of the method itself [24]. To the best of our knowledge, stability is seldom considered in feature selection criteria. Therefore, stability constraint is employed in this study to obtain robust selection results:

$$ f_{opt} = \arg\max(D-R+k{\times}S) $$

where S represents existing stability evaluation index. k is a parameter, which balances prediction factor (DR) and stability factor S. Then, the stability evaluation index can be computed by:

$$ S(f,F)=\frac{1}{i-1}\sum_{j=1}^{i-1}S\left(F_{f},F_{j}\right) $$
$$ S\left(F_{f},F_{j}\right)=\frac{|F_{f}\cap F_{j}|}{|F_{f}\cup F_{j}|} $$

where F f is the union between the selected features and the optimal feature f to be selected in the current selection, F j (j=1,2,…,i−1) represents the selected feature subset, and |F f F j | and |F f F j | represent the intersection and union between feature sets F f and F j , respectively. Unlike Eq. (1), both predictability and stability are used in the the feature selection criterion of Eq. (5). As shown as in Fig. 1c, stability constraint helps obtain consistent results of ranking.

Similar to predictability and stability, interpretability is essential for feature selection [20]. However, mutual information fails to measure the correlation between different types of features, as multivariate density estimation is hard to accurately estimate. Both Eqs. (1) and (5) fail to select interpretive results. Instead of mutual information, generalized correlation coefficient (GCC) is adopted to measure D and R from Eqs. (1) to (5) for preserving predictability. Given v−1 types of feature \(\bar {F}_{v-1}^G=\bar {G}^{(1)} \cap \bar {G}^{(2)}\cap \ldots \bar {G}^{(v-1)}\) selected from the entire feature set of l types \({F}_{v-1}^G={G}^{(1)}\cap {G}^{(2)}\cap \ldots {G}^{(v-1)}\), where \(\bar {G}^{(x)}\) denotes the x th selected feature type (x=1,2…,v−1), selecting the v th type \(\bar {G}^{(V)}\) from set \(\left \{F^{G} - F_{v-1}^{G}\right \}\) is based on the following condition:

$$ {\begin{aligned} \bar{G}^{(v-1)}\,=\,\arg\max_{G^{(j)}\in{F^{G}-F^{G'}_{v-1}}}\left[\rho\left(G^{(j),c}\right) \,-\,\frac{1}{v-1}\sum_{\bar{G}^{(i)}\in \bar{F}^{G}_{v-1}}\rho\left(G^{(j)}, \bar{G}^{(i)}\right)+k{\times}S\right] \end{aligned}} $$

where ρ represents generalized correlation coefficient between different feature types, \(\bar {G}^{(i)}\) the i-th selected feature type, and G(j) denotes a certain feature type from the candidate feature set, \(F^G-F^{G^{\prime }}_{v-1}\). Generalized correlation coefficient is degraded to Pearson’s correlation coefficient when the dimensionality of \(\bar {G^{(i)}}\) and G(j) is 1.

In the case of only using GCC in Eq. (8) when k=0, the corresponding feature selection takes predictability and interpretability into account, as shown in Fig. 1d. The selected features of the same feature type are close to each other while the corresponding ranking may greatly change due to data fluctuation. If k≠0 in Eq. (8), it means that the feature selection simultaneously takes predictability, stability, and interpretability into account, which is the so-called SIP-FS method in this paper, as shown in Fig. 1e. From an interpretative point of view, features selected by SIP-FS method are meaningful class-specific features [35] with the use of one-versus-all strategy.

4 SIP-FS algorithm

SIP-FS aims to select a reasonable and compact feature subset for data representation efficiently; thereby, the selected result should be meaningful and insensitive to data fluctuation as well as performing well in prediction accuracy.

SIP-FS is implemented by repeated iteration until stable and selects the feature subset obtained (uses the selected/obtained feature subset) at the last iteration as the final result. For the i-th iteration, k=λ1i and the stability S i is computed by the mean of all stabilities between F i and F j (j=1,2,…i−1), where F i and F j represent the i-th and the j-th selected feature subset, respectively.

$$ s_{i} = \frac{1}{i-1}\sum_{j=1}^{i-1}S\left(F_{i},F_{j}\right) $$

where \(S\left (F_{f},F_{j}\right)=\frac {|F_{f}\cap F_j|}{|F_{f}\cup F_j|}\). The iteration stops until the following condition is satisfied:

$$ |S_{i}-S_{i-1}| \to 0 $$

Each iteration consists of two parts: (1) selecting feature types, corresponding to steps 3 to 6 as shown in Algorithm 1 and (2) removing the redundancy from the selected feature type, corresponding to steps 7 to 12 as shown in Algorithm 1. In the first part, feature types are selected based on Eq. (8) until other feature types can not provide additional information, as (11).

$$ {\begin{aligned} \left| \left(D\left(\bar{F}^{G}_{v+1},c\right) \,-\,R\left(\bar{F}^{G}_{v+1}\right)\right)\,-\,\left(D\left(\bar{F}^{G}_{v},c\right) -R\left(\bar{F}^{G}_{v}\right)\right)\right| \to 0 \end{aligned}} $$

The first part could obtain the ranking of feature type; however, in each selected feature type, there may exist redundancy. Therefore, in the second part, the redundancy of each feature type is further removed by selecting a subset. Given that m−1 features are selected from the v-th feature type, the selection of the m-th feature \(\bar {f}_{m}^{(v)}\) is described as follows.

$$ {\begin{aligned} \bar{f}_{m}^{(v)} = \arg\max\left[\rho\left(G^{(v)}_{m},c\right) -\frac{1}{v-1}\sum_{\bar{G}^{(i)}\in \bar{F}^{G}_{v-1}}\rho\left(G^{(v)}_{m}, \bar{G}^{(i)}\right)+k{\times}S\right] \end{aligned}} $$

where \(G_{m}^{(v)} = \bar {G}_{m-1}^{(v)}\cup f_{m}^{(v)} = \bar {f}_{1}^{(v)} \cup \bar {f}_{2}^{(v)} \cup...\bar {f}_{m-1}^{(v)}\cup {f}_{m-1}^{(v)}\), \({f_{m}^{(v)}}\) denotes a certain feature in the candidate feature set. For the v-th feature type G(v), a subset is obtained until other features can not provide additional information, as in the following equation.

$$ {\begin{aligned} &\left| \left(D\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m+1}\right),c\right) -R\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m+1}\right)\right)\right)\right.\\& \quad-\left.\left(D\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m}\right),c\right) -R\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m}\right)\right)\right)\right| \to 0 \end{aligned}} $$

where \(\hat {F}^{G}_{sel}=\hat {G}^{(1)} \cup \hat {G}^{(2)} \cup...\cup \hat {G}^{(v-1)}, \hat {G}^{(v)}_{(m+1)}=\hat {G}^{(v)}_{(m)}\cup \bar {f}_{m+1}^{(v)} \)

5 Results and discussions

In this section, extensive experiments are conducted to illustrate the effectiveness of SIP-FS in terms of predictability, stability, and interpretability. Four feature selection methods, mRMR, ReliefF, En-mRMR, and En-Relief, are used for performance comparisons on three publicly available datasets (two web image datasets named MIML [36] and NUS-WIDE-LITE [37], a remote sensing image dataset named USGS21 [38]). mRMR, ReliefF are commonly used filter methods, while En-mRMR, and En-Relief are two ensemble methods. One versus all strategy is adopted to select class-specific features for SIP-FS as well as other comparison methods.

For the three datasets, different types of feature are used followed by normalization individually. Libsvm [39] is used for training and classification. The images in each dataset are divided into two equal parts, in which one for training and the other for testing. Experiments are randomly repeated 10 times to report the average results.

5.1 Datasets

MIML consists of five classes, which are desert, mountain, sea, sunset, and trees. The number of five classes is 340, 268, 341, 261, and 378, respectively. Figure 2 shows sample images of this dataset. Eight types of feature (a total of 638 dimensions), color histogram, color moments, color coherence, textures, tamura-texture coarseness, tamura-texture directionality, edge orientation histogram, and SBN colors are used in experiments. The dimension of these features is 256, 6, 128, 15, 10, 8, 80, and 135, respectively.

Fig. 2
figure 2


NUS-WIDE-LITE contains images from collected by the National University of Singapore. In experiments, the images with zero label or more than one labels are removed, resulting in a single label dataset which contains nine classes: birds, boats, flowers, rocks, sun, tower, toy, tree, and vehicle, as shown in Fig. 3. Five types of feature (a total of 634 dimensions), color histogram, block-wise color moments, color correlogram, edge direction histogram, and wavelet texture are used for experimental evaluation. The dimension of these features is 64, 225, 144, 73, and 128, respectively.

Fig. 3
figure 3


USGS21 contains 21 classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts, as shown in Fig. 4. Each class consists of 100 256×256-pixels images with the spatial resolution of one foot. Five types of feature (a total of 809 dimensions), color moment, HOG, Gabor, LBP, and Gist, extracted by [40] are used for evaluation. The dimension of these features is 81, 37, 120, 59, and 512, respectively.

Fig. 4
figure 4


5.2 Effects of λ 1 and λ 2 on stability

In the proposed method, two parameters, λ1 and λ2, have influence on the performance of stability. λ1 determines the k value, which balances predictability and stability, while λ2 determines the proportion of subsample generation in iterative feature selection. Suitable combination of λ1 and λ2 is beneficial for obtaining consistent results.

The parameter tuning is conducted for each class individually. Figure 5 shows the influence of λ1 and λ2 on stability for three different classes, where λ1 is in the range of 0.0001, 0.001, 0.01, 0.1, 1, 10, and λ2 is in the range of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. In general, high stability can be obtained using moderate λ1 value (e.g., 0.001, 0.01, and 0.1) and large λ2 (0.8 or 0.9), compared with other parameter combinations. The smaller λ1 value corresponds to better stability, yet the computational complexity is significantly increased. Small λ2 may result in high fluctuation of subsamples, leading to inconsistent selected results.

Fig. 5
figure 5

Effects of λ1 and λ2 on stability for three specific classes. a “trees” in MIML. b “flowers” in NUS-WIDE-LITE. c “building” in USGS21

5.3 Stability analysis

Tables 1, 2, and 3 show the stability comparisons of five methods on the three datasets. The stabilities of each class and the entire dataset (average) are given in these tables. The stability value ranges from 0 to 1, whereas, “0” and “1” represent the ranking of the selected results are completely inconsistent and consistent in randomly repeated feature selections, respectively.

Table 1 Stability comparisons on MIML
Table 2 Stability comparisons on NUS-WIDE-LITE
Table 3 Stability comparisons on USGS21

For Tables 1, 2, and 3, compared with other methods, SIP-FS significantly achieves stability improvement for each class (except for “dense residential” and “medium residential” shown in Table 3) as well as the entire dataset, indicating that SIP-FS helps select much more stable features.

In general, mRMR combined with ensemble strategy does not indicate significant improvement in terms of stability. Though ensemble strategy indicates slightly stability advantage for ReliefF, En-ReliefF performs worse than SIP-FS. Overall, SIP-FS performs best on the three datasets in terms of stability.

5.4 Interpretability analysis

Given a certain class, the prediction accuracy varies with feature types. How to select feature types and measure their effectiveness for a specific class are essential for interpretability analysis. In particular, one-versus-all strategy are combined with SIP-FS to select feature types (each contains a certain number of features) for a specific class. The effectiveness of these feature types for each class are measured by the relative contribution ratio, which are normalized by the respective maximum contribution [14].

Figures 6, 7, and 8 show the selected feature types for each class with the respective relative contribution ratio. For example, the selected feature types for “mountain” in MIML are shape and color features, as shown in Fig. 6. According to the relative contribution ratios, the selected feature types are edge orientation histogram, color coherence, color histogram, and SBN color. The most discriminative feature type is shape and the other three are color features (color coherence, color histogram, and SBN colors). However, some texture features (textures, tamura-texture coarseness, and tamura-texture) and redundant color feature (color moments) are removed. As shown in Fig. 7, color correlogram, edge direction (oriented) histogram, and wavelet texture provide complementary information for describing each class in NUS-WIDE-LITE dataset. In addition, block-wise color moments provide less information for most of classes in this dataset, while color moments are useless because of the information redundant. In USGS21 dataset, take the big class road (including freeway, overpass, and runway) and water (including bench and river) as two examples, as shown in Fig. 8. LBP is the most discriminative feature type for “road” while color moment is the most discriminative feature type for “water”. Furthermore, as a subclass of water, a river need additional complementary information provided by the other four feature types (LBP, Gabor, HOG and Gist) besides color moment. In general, SIP-FS provides a more interpretive data representation than other comparison methods.

Fig. 6
figure 6

Relative contribution ratios of features for each class of MIML

Fig. 7
figure 7

Relative contribution ratios of features for each class of NUS-WIDE-LITE

Fig. 8
figure 8

Relative contribution ratios of features for each class of USGS21

In short, the proposed SIP-FS method provides a more interpretable means for data representation than that of the existing feature selections. More useful information will become available, deepening the understanding of data.

5.5 Predictability analysis

Tables 4, 5 and 6 show the prediction accuracy of each class on the three datasets to evaluate the predictability, respectively. The predictability value ranges from 0 to 1, whereas, “0” and “1” represent completely misclassification and completely correct classification, respectively.

Table 4 Predictability comparisons on MIML
Table 5 Predictability comparisons on NUS-WIDE-LITE
Table 6 Predictability comparisons on USGS21

From Table 4, the predictability of mRMR four classes (e.g., mountains, sea, sunset, and trees) of MIML performs better than that of other methods. Although SIP-FS performs worse than mRMR in terms of average performance, it shows advantages than the other three methods. From Table 5, mRMR and SIP-FS perform best among all methods in terms of average performance. The comparison of mRMR and SIP-FS indicates that both methods have their own accuracy advantages on some classes. For example, the prediction accuracies of SIP-FS on boats, rocks, sun, and vehicle indicate advantages over that of mRMR. From Table 6, the average predictability performances of mRMR, En-mRMR and SIP-FS indicate significantly advantages over that of the others (ReliefF and En-ReliefF). It is worth noting that although En-ReliefF obtains the highest stability on “dense residential” and “medium residential” (as shown in Table 3), it has the lowest prediction accuracy (as shown in Table 6).

In general, SIP-FS and mRMR perform best among all comparison methods on the three datasets, demonstrating that it can maintain good predictability.

To further investigate the effect of the number of selected features on predictability performance, Fig. 9 shows the prediction accuracy of five feature selection methods on three different classes. In general, the prediction accuracy of the five methods tends to increase with the number of selected features increases. Desirable prediction results can be obtained by selecting the leading features, such as 20 (trees), 30 (flowers), and 20 (building) features, corresponding to Fig. 9ac.

Fig. 9
figure 9

The number of selected features on predictability performance. a “trees” in MIML. b “flowers” in NUS-WIDE-LITE. c “building” in USGS21

5.6 Trade-off between stability and predictability

In the section, stability-predictability tradeoff (SPT) is used to provide a formal and automatic way of jointly evaluating the trade-off between stability and predictability, as in ref. [29]. The definition of SPT is as follows.

$$ SPT = \frac{2\times \text{stability}\times\text{predictability}}{\text{stability} + \text{predictability}} $$

where stability (Tables 1, 2 and 3) and predictability (Tables 4, 5 and 6) denote the average performance. SPT ranges from 0 to 1, the higher the SPT, the better the performance. The SPTs for the three datasets are shown in Fig. 10. Several conclusions can be drawn from Fig. 10: (1) Compared with other methods, SIP-FS can obtain better tradeoff between stability and predictability. (2) mRMR and ReliefF combined with ensemble strategy indicates higher SPT than that without ensemble strategy.

Fig. 10
figure 10

SPT comparisons for three datasets

6 Conclusions

In this study, a novel feature selection method called SIP-FS is proposed to explore the stability and interpretability simultaneously while preserving predictability. Given a set of distinct feature types, the relation between different feature types is measured by minimal redundancy maximal relevance based on generalized correlation. Several feature types can then be selected and used to determine what types contribute to a specific class by quantitative evaluation. Furthermore, consistent results of ranking can be achieved through incorporating stability into the criterion of SIP-FS. The experiments on three datasets, MIML, NUS-WIDE-LITE, and USGS21, demonstrate that the performances of stability and interpretability are significantly improved without sacrificing predictability, compared with other filter and their respective ensemble-based methods. In future work, we intend to further investigate the selection of multi-modal information using SIP-FS.


  1. T Ojala, M Pietikainen, D Harwood, in Proceedings of the 12th International Conference on Pattern Recognition. Performance evaluation of texture measures with classification based on kullback discrimination of distributions (IEEENew York, 2002), pp. 582–5851.

    Google Scholar 

  2. TS Lee, Image representation using 2d gabor wavelets. IEEE Trans. Pattern Anal. Mach. Intell.18(10), 959–71 (1996).

    Article  Google Scholar 

  3. X Jiang, J Lai, Sparse and dense hybrid representation via dictionary decomposition for face recognition. IEEE Trans. Pattern Anal. Mach. Intell.37(5), 1067–79 (2015).

    Article  Google Scholar 

  4. H Liu, L Yu, Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng.17(4), 491–502 (2005).

    Article  MathSciNet  Google Scholar 

  5. Z Li, J Liu, Y Yang, X Zhou, H Lu, Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans. Knowl. Data Eng.26(9), 2138–2150 (2014).

    Article  Google Scholar 

  6. PN Belhumeur, JP Hespanha, DJ Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (2002).

    Article  Google Scholar 

  7. F Zamani, M Jamzad, A feature fusion based localized multiple kernel learning system for real world image classification. EURASIP J. Image Video Process.2017(1), 78 (2017).

    Article  Google Scholar 

  8. F Poorahangaryan, H Ghassemian, A multiscale modified minimum spanning forest method for spatial-spectral hyperspectral images classification. EURASIP J. Image Video Process.2017(1), 71 (2017).

    Article  Google Scholar 

  9. X He, P Niyogi, in 17th Annual Conference on Neural Information Processing Systems (NIPS). Locality preserving projections (MIT PRESSCambridge, 2003), pp. 186–197.

    Google Scholar 

  10. Y Wang, C Han, C Hsieh, K Fan, Vehicle color classification using manifold learning methods from urban surveillance videos. EURASIP J. Image Video Process.2014:, 48 (2014).

    Article  Google Scholar 

  11. H Peng, F Long, CHQ Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.27(8), 1226–38 (2005).

    Article  Google Scholar 

  12. Y Li, J Si, G Zhou, S Huang, S Chen, FREL: A Stable Feature Selection Algorithm. IEEE Trans. Neural Netw. Learn. Syst. 26(7), 1388–1402 (2017).

    Article  MathSciNet  Google Scholar 

  13. T Le, S Kim, On measuring confidence levels using multiple views of feature set for useful unlabeled data selection. Neurocomputing. 173:, 1589–601 (2016).

    Article  Google Scholar 

  14. X Chen, T Fang, H Huo, D Li, Measuring the effectiveness of various features for thematic information extraction from very high resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens.53(9), 4837–51 (2015).

    Article  Google Scholar 

  15. Y Sun, Iterative RELIEF for feature weighting: algorithms, theories, and applications. IEEE Trans. Pattern Anal. Mach. Intell.29(6), 1035–51 (2007).

    Article  Google Scholar 

  16. CM Bishop, Pattern recognition and machine learning, 5th Edition. Information science and statistics (Springer, New Haven, 2007).

    Google Scholar 

  17. H Wei, SA Billings, Feature subset selection and ranking for data dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell.29(1), 162–66 (2007).

    Article  Google Scholar 

  18. M Dash, H Liu, Consistency-based search in feature selection. Artif. Intell.151(1-2), 155–176 (2003).

    Article  MathSciNet  MATH  Google Scholar 

  19. I Guyon, A Elisseeff, An introduction to variable and feature selection. J. Mach. Learn. Res.3:, 1157–82 (2003).

    MATH  Google Scholar 

  20. A-C Haury, P Gestraud, J-P Vert, The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE. 6(12), e28210 (2011).

    Article  Google Scholar 

  21. J Li, H Liu, S-K Ng, L Wong, Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics (Oxford, England). 19:, 93–102 (2003).

    Google Scholar 

  22. W Hu, W Li, X Zhang, SJ Maybank, Single and multiple object tracking using a multi-feature joint sparse representation. IEEE Trans. Pattern Anal. Mach. Intell.37(4), 816–33 (2015).

    Article  Google Scholar 

  23. H Wang, F Nie, H Huang, in Proceedings of the 30th International Conference on Machine Learning, 28. Multi-view clustering and feature learning via structured sparsity (JMLR.orgAtlanta, 2013), pp. 1389–1397.

    Google Scholar 

  24. P Somol, J Novovicová, Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell.32(11), 1921–39 (2010).

    Article  Google Scholar 

  25. PC K. Dunne, F Azuaje, Solutions to instability problems with sequential wrapper-based approaches to feature selection. TCD-CS-2002-28. J. Mach. Learn. Res, 1–22 (2002).

  26. A Kalousis, J Prados, M Hilario, Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst.12(1), 95–116 (2007).

    Article  Google Scholar 

  27. S Loscalzo, L Yu, CHQ Ding, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Consensus group stable feature selection (ACMNew York, 2009), pp. 567–576.

    Chapter  Google Scholar 

  28. M Zucknick, S Richardson, EA Stronach, Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat. Appl. Genet. Mol. Biol.7(1), 95–116 (2008).

    Article  MathSciNet  MATH  Google Scholar 

  29. Y Saeys, T Abeel, YV de Peer, in Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases. Robust feature selection using ensemble feature selection techniques (SpringerBerlin, 2008), pp. 313–25.

    Chapter  Google Scholar 

  30. Y Li, S Gao, S Chen, in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Ensemble feature weighting based on local learning and diversity (AAAIMenlo Park, 2012).

    Google Scholar 

  31. A Woznica, P Nguyen, A Kalousis, in The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Model mining for robust feature selection (ACMNew York, 2012), pp. 913–921.

    Google Scholar 

  32. Y Han, L Yu, in ICDM 2010, in The 10th IEEE International Conference on Data Mining. A variance reduction framework for stable feature selection (IEEEWashington, 2010), pp. 206–215.

    Chapter  Google Scholar 

  33. L Yu, Y Han, ME Berens, Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans. Comput. Biology Bioinform.9(1), 262–72 (2012).

    Article  Google Scholar 

  34. L Yu, CHQ Ding, S Loscalzo, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Stable feature selection via dense feature groups (ACMNew York, 2008), pp. 803–811.

    Chapter  Google Scholar 

  35. X Chen, G Zhou, Y Chen, G Shao, Y Gu, Supervised multiview feature selection exploring homogeneity and heterogeneity with l12-norm and automatic view generation. IEEE Trans. Geosci. Remote Sens.55(4), 2074–88 (2017).

    Article  Google Scholar 

  36. Z Zhou, M Zhang, in The Twentieth Annual Conference on Neural Information Processing Systems. Multi-instance multi-label learning with application to scene classification (MIT PressCambridge, 2006), pp. 1609–1616.

    Google Scholar 

  37. T Chua, J Tang, R Hong, H Li, Z Luo, Y Zheng, in Proceedings of the 8th ACM International Conference on Image and Video Retrieval. NUS-WIDE: a real-world web image database from national university of singapore (ACMNew York, 2009).

    Google Scholar 

  38. Y Yang, SD Newsam, in Proceedings of the 18th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems. Bag-of-visual-words and spatial extensions for land-use classification (ACMNew York, 2010), pp. 270–279.

    Google Scholar 

  39. C Chang, C Lin, LIBSVM: A library for support vector machines. ACM TIST. 2(3), 27–12727 (2011).

    Google Scholar 

  40. The Feature Extrction. Accessed 2014.

Download references


Not applicable.


This work was supported by the National Key Basic Research and Development Program of China under Grant 2012CB719903, the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant 61221003, the National Natural Science Foundation of China under Grant 41071256, 41571402, and the National Science Foundation of China Youth Program under Grant 41101386.

Availability of data and materials

Not applicable.

Author information

Authors and Affiliations



YG and JJ have implemented the algorithms and performed most of the experiments. HH, TF and DL modified the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tao Fang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Y., Ji, J., Huo, H. et al. SIP-FS: a novel feature selection for data representation. J Image Video Proc. 2018, 14 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: