Data feature selection based on Artificial Bee Colony algorithm

Classification of data in large repositories requires efficient techniques for analysis since a large amount of features is created for better representation of such images. Optimization methods can be used in the process of feature selection to determine the most relevant subset of features from the data set while maintaining adequate accuracy rate represented by the original set of features. Several bioinspired algorithms, that is, based on the behavior of living beings of nature, have been proposed in the literature with the objective of solving optimization problems. This paper aims at investigating, implementing, and analyzing a feature selection method using the Artificial Bee Colony approach to classification of different data sets. Various UCI data sets have been used to demonstrate the effectiveness of the proposed method against other relevant approaches available in the literature.


Introduction
Data analysis aims at extracting and modeling information content to identify patterns within the data. As a manner of simplifying the amount of information to describe a large set of data, features are extracted from the data, serving as representative characteristics of its contents. In image analysis, for instance, examples of features include color, texture, edges, object shape, interest points, among others. These features usually are organized into an n-dimensional feature vector.
Feature selection is an important step used in several tasks, such as image classification, cluster analysis, data mining, pattern recognition, image retrieval, among others. It is a crucial preprocessing technique for effective data analysis, where only a subset from the original data features is chosen to eliminate noisy, irrelevant or redundant features. This task allows to reduce computational cost and improve accuracy of the data analysis process. This paper proposes a feature selection method for data analysis based on Artificial Bee Colony (ABC) approach that can be used in several knowledge domains through wrapper and forward strategies. The ABC method has been widely used to solve optimization problems; however, there have been few works on feature selection. Our work proposes a binary version of the ABC algorithm, where the number of new features to be analyzed in a neighborhood of a food source is determined through a perturbation parameter proposed by Karaboga and Akay [1]. The method is analyzed and compared to other relevant approaches available in the literature. Experimental results showed that a reduced number of features can achieve classification accuracy superior than that using the full set of features. The accuracy has significantly increased even though the number of selected features has drastically reduced. Furthermore, the proposed method presented better results for the majority of the tested data sets compared to other algorithms.
The paper is organized as follows: Initially, some relevant concepts and work related to feature selection are described. The proposed methodology for feature selection is then presented in detail. Experimental results obtained through the application of the proposed method to several data sets are described and discussed. Finally, the remaining section concludes the paper with final remarks and directions for future work.

Related concepts and work
The process of feature selection is responsible for electing a subset of features, which can be described as a search into a state space. One can perform a full search in which all the spaces are traversed; however, this approach is impractical for a large number of features. A heuris-http://jivp.eurasipjournals.com/content/2013/1/47 tic search considers the features, not yet selected at each iteration, for evaluation. A random search generates random subsets within the search space, such that several bioinspired and genetic algorithms use this approach [2].
Feature selection can be described as a search into a space of states, and according to the initialization and behavior during the search steps, we can divide the search into three different approaches [3]: forward: the feature subset is initialized empty and features are included in the subset during the feature selection; backward: the feature subset is initialized with a full set of features and the features are excluded from the subset during the feature selection process; bidirectional: features can be inserted or excluded during the feature selection process.
Feature selection methods can be classified into two main categories: filter approaches [4][5][6][7][8][9] and wrapper approaches [10][11][12][13][14]. In filter approaches, a filtering process is performed before the classification process; therefore, they are independent of the used classification algorithm [15]. A weight value is computed for each feature, such that those features with better weight values are selected to represent the original data set. On the other hand, wrapper approaches generate a set of candidate features by adding and removing features to compose a subset of features. Then, they employ accuracy to evaluate the resulting feature set. Wrapper methods usually achieve superior results than filter methods.
The use of Swarm Intelligence for feature selection has increased in the last years. Suguna and Thanushkodi [23] proposed a rough set approach with ABC algorithm for dimensionality reduction using different medical data sets in the area of Dermatology for tests, whereas Shokouhifar and Sabet [24] employed the same algorithm (ABC) for feature selection using neural networks. Particle Swarm Optimization has been proposed for feature selection either as filter method [15] or as wrapper method [25][26][27]. Nakamura et al. [2] proposed a wrapper method using a BAT algorithm with OPF classifier. Among feature selection approaches to Ant Colony Optimization, we can highlight the ACO for image feature selection proposed by Chen et al. [28].
The Artificial Bee Colony is a Swarm Intelligent algorithm used to solve optimization problems in several research areas [29][30][31][32][33]. It was proposed by Karaboga [20] in 2005, based on forage for honeybees. Frisch [34], Frisch and Lindauer [35], and Seeley [36] have investigated the foraging behavior of bees, external information (odor, location information in waggle dance, presence of other bees in the food source or between the hive and source), and internal information (source location and source odor). The process starts when bees leave the hive of a forage to search for a food source (nectar). After finding nectar, the bees store it in their stomach. After coming back to the hive, the bees unload the nectar and perform a waggle dance to share their information about the food source (nectar quantity, distance and direction from black the hive) and recruit new bees for exploring most rich food sources [37].
The minimum model of ABC to emerge a collective intelligence of bee swarm consists of three components: food sources, employed bees, and unemployed bees [38], which are described as follows: • Food sources: each food source represents a probable solution to the problem. • Employed bees: employed bees find a food source, store information about its quality, and share this information with other bees in the honeycomb. The number of food source and that of employed bees are the same. • Unemployed bees: unemployed bees can be of two types: onlooker bees or scout bees.
-Onlooker bees: onlooker bees receive information from employed bees about the quality of food sources and choose food sources with better quality to explore the neighborhood. At the moment that onlooker bees choose a food source to explore, they become employed bees. -Scout bees: employed bees become scout bees when a food source is exhausted. In other words, the employed bees explored a food source neighborhood MAX LIMIT times; however, they did not find any food source with better quality. Scout bees try to find new food sources.
A general pseudocode for the ABC optimization approach [22] is shown in Algorithm 1.

Initialization phase
The original algorithm [1] proposes a random creation of food sources, such that each one of them corresponds to a possible solution to the problem where i = 1, . . . , N, j = 1, . . . , D, such that N is the number of food sources and D is the number of optimization parameters.

Employed bee phase
Each employed bee will explore the neighborhood of the food sources associated to them. The neighborhood exploration is defined as For each food source, x i , a food source v i is determined through the modification of an optimization parameter j, that is, x ij is modified. Indices j and k are random variables. The value of k is at the range 1, 2 . . . , SN and must be different from i. ij is a real number between −1 and 1.
Once v i is produced, the fitness value of the food source is obtained by where f i is a cost function. For maximization problems, the cost function can be directly used as a fitness value. After all employed bees have conducted their search, they share the information about the quality of the food source with the onlooker bees. The probability of an onlooker bee to choose a food source to be explored is associated to its fitness, that is, Through the values of exploration probabilities, the food sources are selected by the onlooker bees.

Onlooker bee phase
The food sources with better probability to be explored are selected by the onlooker bees, which become the employed bees. The neighborhood of the selected food sources are explored as explained in the 'Employed bee phase' subsection.

Scout bee phase
The algorithm checks to see if there is any exhausted source to be abandoned. In order to decide if a source is to be abandoned, the LIMIT variable which has been updated during search is used. If the value of the LIMIT is greater than that of the MAX LIMIT, then the food source is assumed to be exhausted and is abandoned. The food source abandoned by its bee is replaced with a new food source discovered by the scout. The new food source associated with the scout bee is created randomly.

Artificial Bee Colony algorithm for feature selection
Unlike optimization problems, where the possible solutions to the problem can be represented by vectors with real values, the candidate solutions to the feature selection problem are represented by bit vectors.
Each food source is associated with a bit vector of size N, where N is the total number of features. The position in the vector corresponds to the number of features to be evaluated. If the value at the corresponding position is 1, this indicates that the feature is part of the subset to be evaluated. On the other hand, if the value is 0, it indicates that the feature is not part of the subset to be assessed. Additionally, each food source stores its quality (fitness), which is given by the accuracy of the classifier using the feature subset indicated by the bit vector.
The main steps of the proposed feature selection method are illustrated in Figure 1. Each step is described as follows: 1. Create initial food sources: for feature selection, it is desirable to search for the best accuracy using the lowest possible number of features. For this reason, the proposed method follows the forward search strategy. The algorithm is initialized with N food sources, where N is the total number of features. Each food source is initialized with a bit vector of size N, where only one feature will be presented in the feature subset, that is, only one position of the vector will be filled with 1. 2. Submit a feature subset of food sources to the classifier and use accuracy as fitness: the feature subset of each food source is submitted to the classifier, and accuracy is stored as the fitness of food source. 3. Determine neighbors of chosen food sources by employed bees using modification rate (MR) parameter: each employed bee visits a food source and explores its neighborhood. For feature selection, a neighbor is created from the bit vector of the original food source. In the basic version of ABC algorithm, the neighborhood is defined by performing a small perturbation in only an optimization parameter through Equation 2, which makes convergence slower. In the feature selection, the optimization parameters are represented by the bit vectors and their perturbation is performed by a http://jivp.eurasipjournals.com/content/2013/1/47 Figure 1 Steps of ABC feature selection. Diagram with the main steps of the proposed ABC feature selection method.
perturbation frequency or MR [1]. For each position of the bit vector or feature, a random and uniform number R i is generated in the range between 0 and 1. If this value is lower than the perturbation parameter MR, the feature is inserted into the subset, that is, the vector value at that position is filled with 1.
Otherwise, the value of the b it vector is not modified. This is expressed in Equation 5 : http://jivp.eurasipjournals.com/content/2013/1/47 where x i is the position i in the bit vector. 4. Submit a feature subset of neighbors to the classifier and use accuracy as fitness: the feature subset created for each neighbor is submitted to the classifier, and accuracy is stored as the neighbor's fitness. 5. Fitness of neighbor is better?: if the food source quality of the newly created neighbor is better than the food source under exploration, then the neighbor food source is considered as a new one and information about its quality will be shared with other bees. Otherwise, variable LIMIT, from the food source where the neighborhood is being explored, is incremented. If the value of LIMIT is greater than that of MAX LIMIT, then the food source is abandoned, that is, the food source is exhausted. In other words, the employed bees explored a food source neighborhood MAX LIMIT times; however, they did not find any food source with better quality, such that it is not worthwhile following a way where all food sources around it have worse quality than the current source. For each abandoned source, the method creates a scout bee to randomly search a new food source. The mechanism of search is illustrated in Figure 2. 6. All onlookers are distributed?: onlooker bees collect information about the fitness of food sources visited by employed bees and choose food sources with either better probability of exploration or better fitness. At the moment that onlooker bees choose the food source to be explored, they become employed bees and execute step 3. 7. Memorize the best food source: after all onlookers have been distributed, the food source with the best fitness is stored. 8. Find abandoned food sources and produce new scout bees: for each abandoned food source, a scout bee is created and a new food source is generated, where a bit vector with size N of features is randomly created and submitted to the classifier, and accuracy is stored. The new food source is assigned to scout bees, and then they become employed bees and execute step 3.

Experimental results
This section describes the data sets tested in our experiments, the computational resources used to implement and evaluate the proposed feature selection method, the strategies adopted in the data classification, the ABC parameters, as well as a discussion of the experimental results.

Data sets
The proposed method has been evaluated through ten data sets from different knowledge fields. The data sets are available from UCI Machine Learning Repository [39]. Table 1 presents a description of the tested data sets, including the number of instances, number of features, and number of classes for each data set. UCI data sets have been widely used in the evaluation of data classification since they contain a varied number of features and classes, allowing the analysis of influence on accuracy and performance when features are selected ( Table 2).

Comparison against other methods
The proposed method was compared to some relevant swarm approaches: ACO, PSO, and genetic algorithms (GAs) ( Table 3).

Computational environment
All the experiments have been conducted on a computer with Intel Core I7-2600 3.4 GHz and 4-GB RAM. The Artificial Bee Colony feature selection algorithm  was implemented using Java programming language with Weka [40] and LibSVM [41] libraries to execute the data classification.

Classification setup
To evaluate the accuracy and performance of the classification process with the original and selected feature sets, a ten fold cross-validation is used. In k-fold crossvalidation, the data set is randomly partitioned into k equally sized folds (samples). One partition is retained as the test set, whereas the remaining k − 1 samples are used as the training set. This process is repeated k times, where one of the partitions becomes test data at each time. The average of k results produces an estimation of the accuracy. The accuracy measure employed for evaluating the results is the percentage of instances correctly classified, that is, for which a correct prediction was made.
In some tests, the feature vector has been normalized using z-score [42], that is, the features are normalized by subtracting their mean value and dividing them by their standard deviation.

ABC parameters
The following parameters are used in the ABC algorithm:

ACO parameters
The following parameters are used in the ACO algorithm: -Population size = 10 -Number of generations = 10 -Alpha = 1 -Beta = 2 -Report frequency = 10

GA parameters
The following parameters are used in the GA: -Population size = 200 -Number of generations = 20 -Probability of crossover = 0.6 -Probability of mutation = 0.033 -Report frequency = 20   Table 4 shows the results obtained by applying the proposed feature selection method for each data set. It is possible to observe that the selected feature set provides superior accuracy than the original feature set for all data sets, even though the number of selected features is much smaller than the original one for some data sets, such as Auto, Heart-Statlog, and Hepatic. It can be observed that in terms of accuracy, the ABC algorithm obtained superior results (eight out ten tested data sets) when compared to other methods. Only for the Image Segmentation and Diabetes data sets, the accuracy of the proposed method was worse. For the Diabetes data set, although the other algorithms obtained a better accuracy, they did not reduce the set of features, that is, they used all the features. The proposed algorithm used only one feature; however, despite this fact, its accuracy was compatible to the other algorithms (75.65% against 71.48%). For the Image Segmentation data set, the proposed algorithm used 12 features against 16 and 17 of the

Conclusions
This work presents a feature selection method based on ABC algorithm. The results show that a reduced number of features can achieve classification accuracy superior to that using the full set of features. For some data sets, the accuracy has significantly increased even though the number of selected features has drastically reduced. The proposed method presented better results for the majority of the tested data sets compared to other algorithms. For future work, we plan to investigate alternative mechanisms to explore neighborhood of food sources, parallelize the exploration of employed bees in relation to the food sources, and create a filter approach combining ABC algorithm, entropy, and mutual information.