A probabilistic segmentation and entropy-rank correlation-based feature selection approach for the recognition of fruit diseases

Agriculture plays a critical role in the economy of several countries, by providing the main sources of income, employment, and food to their rural population. However, in recent years, it has been observed that plants and fruits are widely damaged by different diseases which cause a huge loss to the farmers, although this loss can be minimized by detecting plants’ diseases at their earlier stages using pattern recognition (PR) and machine learning (ML) techniques. In this article, an automated system is proposed for the identification and recognition of fruit diseases. Our approach is distinctive in a way, it overcomes the challenges like convex edges, inconsistency between colors, irregularity, visibility, scale, and origin. The proposed approach incorporates five primary steps including preprocessing,Standard instruction requires city and country for affiliations. Hence, please check if the provided information for each affiliation with missing data is correct and amend if deemed necessary. disease identification through segmentation, feature extraction and fusion, feature selection, and classification. The infection regions are extracted using the proposed adaptive and quartile deviation-based segmentation approach and fused resultant binary images by employing the weighted coefficient of correlation (CoC). Then the most appropriate features are selected using a novel framework of entropy and rank-based correlation (EaRbC). Finally, selected features are classified using multi-class support vector machine (MC-SCM). A PlantVillage dataset is utilized for the evaluation of the proposed system to achieving an average segmentation and classification accuracy of 93.74% and 97.7%, respectively. From the set of statistical measure, we sincerely believe that our proposed method outperforms existing method with greater accuracy.


Introduction
The plant diseases affect both quality and quantity of agricultural products by interfering with set of processes including plant growth, flower and fruit development, and absorbent capacity, to name but a few [1]. Therefore, early detection and classification of plant diseases play a vital role in agriculture farming. Nevertheless, two possible options may be availed -manual inspection and computer vision techniques. The former method is quite difficult and requires a lot of efforts and time [2], while the latter is mostly followed because of its improved performance [3]. Plants show range of symptoms from their early to final stages, which can be easily observed on fruits and leaves/stem with the naked eye. Therefore, set of symptoms can be categorized using computer vision (CV) and other machine learning (ML) methods [4].
A great effort has been made in the field of CV to process visual features extracted from fruits' images for the recognition of multiple diseases [5]. Several existing methods worked well, but not considering different set of constraints -specifically related to image quality [6][7][8][9][10][11], training/testing samples, number of labels, and disease complexity, to name but a few [12]. In this article, two fruits are selected and four different types of fruits' diseases are initially focused including apple scab, apple rust, grapes rot leaves, and grapes leaf blight. Mostly existing methods follow a typical architecture, which includes (a) preprocessing block, (b) segmentation block, (c) feature extraction block, and (d) classification block. Several detection methods are employed by scholars working in this domain including clustering, thresholding, color, shape, and texture-based methods, adaptive approaches, etc. All these methods are somewhat problem dependent and by some means following a same trend -addressing one sort of problems while keeping other problems' parameters fixed. Therefore, no universal mechanism exists which efficiently deals with all kind of problems.
In this article, we are primarily focusing on the classification of aforementioned diseases by following fundamental steps. Our primary contributions are enumerated below:

Major contributions
In this article, we introduced a new automated method for the identification and recognition of apple and grape diseases. The proposed method consists of five major steps: (a) contrast stretching; (b) identification of disease part by a fusion of novel adaptive and quartile deviation (QD)-based segmentation, which efficiently performs at the change in scale, origin, and irregularity of infection regions; (c) feature extraction and fusion; (d) an integrated framework of entropy and rank correlation is implemented for feature selection; and (e) classification. Our major contributions are listed below.
1. A contrast stretching technique based on global min and max values is proposed, which defines a contrast range to determine lower and upper threshold values. 2. An adaptive thresholding method following trapezoidal rule is proposed, which works in two steps: (1) location of infected regions and (2) computing threshold based on maxima and minima -calculated after taking second derivative. 3. A parallel feature fusion methodology is opted, which jointly takes advantage of three sets of feature (color, texture, and shape)to select the most discriminant value. 4. To overcome the problem of curse of dimensionality, a feature selection methodology is proposed, which efficiently assigns ranks to set of features based on entropy.
Bhivini et al. [2] introduced a framework to classify infected regions in apples. In the first stage of segmentation, they utilized K-means clustering to excerpt the infected region and then extract color and texture features from the segmented part. Subsequently, feature fusion is performed using simple concatenation prior to classification using random forest method. Similarly, Shiv et al. [5] introduced a novel method to classify apple diseases based on color, texture, and shape features. The introduced method is comprised of three fundamental steps of segmentation using K-means; extraction of color, texture, and shape features; and classification using multi-class SVM. Following the same trend, Shiv et al. [28] introduced an adaptive approach to detect infectious regions including apple scab, rot, and blotch by achieving a classification accuracy of 93%. The proposed method incorporates three primary steps of segmentation using K-means, feature extraction, and classification using multi-class SVM.
Zhang et al. [29] followed a novel machine learning method for detecting apple diseases. They made use of HSI, YUV, and gray color spaces for the removal of background via thresholding. The infectious regions are extricated by a region growing method to calculate shape, color, and texture features for each region. Finally, the most prominent features are classified using SVM, which are selected using genetic algorithm (GA) and correlation-based feature selection (CFS) method. Similarly, Soni et al. [30] identified plant diseases by following two fundamental steps of segmentation and classification. In the first step, ring-based segmentation is performed to identify infectious regions, followed by the feature extraction step. A probabilistic neural network is used for the final classification of diseases from randomly selected images acquired from the web. Lee et al. [31] implemented a swarm optimization-based method for the identification of apple diseases. Stochastic PSO algorithm finds out 10 spectral features based on pair of bands to return distinctiveness between each pair of classes. The selected features are later utilized by SVM to achieve improved performance. Harshal et al. [32] introduced a framework for the identification and classification of grape diseases. They implemented a background subtraction method for segmentation and later analyze the regions after passing through a high-pass filter. Thereafter, unique fractal-based texture features are extracted and finally classified through a multi-class SVM. They selected downy mildew and black rot diseases for evaluation and achieved classification accuracy of 96.6%.
Pranjali et al. [33] introduced a novel approach of fused classifiers for efficient classification of grape diseases. Initially, both SVM and ANN are utilized independently and then a new ensembles classifier is constructed for final classification. Similarly, Awate et al. [34] introduced a novel idea in which they utilized K-means for segmentation. Later, texture, color, morphological, and structural features are calculated, which are then subjected to ANN classifier for final classification. A general comparison with recent methods is also provided in Table 1 -in terms of segmentation technique, type of features, feature  selection, classification method, disease type, and classification accuracy. From the recent studies, it is quite clear that set of methods including fuzzy, thresholding, and K-means are mostly utilized for the identification of infectious regions. Recently, inclusion of saliency and CNN-based techniques show improved performance in this domain of agricultural farming [38]. Moreover, color and texture features are mostly utilized for final classification, but "curse of dimensionality" is somehow ignored. In this article, we are primarily focusing on contrast stretching, infectious region segmentation, and ultimately feature selection to avoid aforementioned problem. The contrast stretching technique improves the visual characteristics of an input image, which can help in the segmentation phase. A proposed feature selection algorithm aids in improving the overall classification accuracy.

Proposed method
In this section, the proposed method is explained, which incorporates series of steps including preprocessing, image segmentation and fusion, feature extraction, fusion and selection, and a final step of classification. Figure 1 demonstrates a working framework of the proposed method -clearly explaining series of aforementioned steps.

Contrast stretching
Contrast stretching is mostly applied on the images in which visual contents need to be enhanced. In this article, a global contrast stretching technique is proposed, which directly affects the infectious regions by making them maximally differentiable compared to the background. This method initially finds the global maxima and minima of each red, green, and blue channel to generate a new global minima and maxima values. These calculated values are later utilized to find a new range of intensity values against each channel, which in turns locate a new low and high threshold values. Let k=1 b k represent the modified red, green, and blue channels. Here, the red channel is fraction of red = red red+green+blue ; therefore, we used for addition of all pixel values of three channels, and their histograms are shown in Fig. 2.
Suppose T L and T H are low and high threshold values which initialize as 0.01 and 1, respectively. Then global maxima and minima are calculated using initial T L and T H values as follows: where φ max and φ min are global maximum and minimum values, Max and Min represents the max and min functions which select the maximum and minimum values from each channel k, where k ∈ {1 : 3} of three respective channels red, green, and blue denoted by ψ 1 , ψ 2 , and ψ 3 .
The initial values of global maximum and minimum are 1 and 0. Then calculate a new global minimum pixel image by subtracting φ min in to the original image ψ(i, j, k) and effects are shown in Fig. 3b. The information of subtracted image is stored in a temporary array (T ar ) of size 256 × 256 and find the maximum and minimum pixel value for the entire processed image by Eqs. 2 and 3: These values are utilize to calculate the range of contrast by Eq. 4.
where R ctr denotes the contrast range image of dimension 256 × 256 as shown in Fig. 3c.
To control the variation of contrast stretching, the low threshold (T L ) and high threshold values ( T H ) are updated by Eqs. 5 and 6.
The values of low threshold and high threshold are utilized in contrast stretching cost function to concatenate the results of each channels. The cost function produced the new image, which is more enhanced as compared to original image. The cost function is defined by Eq. 7: where F cost (i, j, k) is a resultant contrast stretched image and R ctr is contrast range value which lies between 0 and 1. Equation 7 shows that if T ar T H −T L ≥ R ctr , then the diseased region in the image is enhanced; otherwise, it improves the background. Contrast stretching final results are shown in Figs. 3 and 4, which are later processed in segmentation phase.

Disease identification
In this section, the proposed segmentation method is elucidated -comprising of proposed segmentation and fusion methods. In the former one, a trapezoidal based adaptive thresholding and a quartile deviation (Q.D)-based segmentation method are employed independently, while, in the latter, binary images are fused using proposed method of weighted coefficient of correlation. Figure 1 demonstrates set of steps for image segmentation and fusion.

Trapezoidal based adaptive thresholding
To identify the infectious regions, a trapezoidal rule is employed [39], which calculates the area of infection by utilizing max and min pixel values.
where Total n denotes the total number of pixels in F cost (i, j, k). A second derivative of an image is later computed and Eq. 8 is updated to find max and min pixel values. The obtained pixel values are finally embedded into a cost function to extract the infectious regions.
where D(i, j) and D 2 (i, j) represent the first and second derivatives of an input image, and Max up and Min up are the updated max and min pixel values. These updated values are initially compared with the old max and min values, defined in Eq. 8, and later updated to calculate the area of infection.
β α f (i)di representing area of the infected region, which is further utilized in the threshold function.
where ξ denotes pixels which are directly linked to β α f (i)di, and T(i, j) represents an optimized adaptive segmented image; sample results are shown in Fig. 5.

Quartile deviation-based segmentation
Quartile deviation-based segmentation is a new segmentation method, which can be directly mapped on to the input image, prior to the thresholding step to generate a binary image. This method works on the basis of coupling -depending on the curve changes. The coupling points are utilized with the normalization function, because Q.D is a property of a normal distribution. Let f (t) ∈ F cost (i, j, k) having dimension (256 × 256 × 3), then the initial function is defined as: where (μ − r) and (μ + r) represent the points of inflection. Taking L.H.S and putting the normalization function in Eq. 15: Equating t−μ σ = X and simplify dt = σ dX to obtain a new equation: According to even property of normal distribution, it will become: where r denotes final Q.D value, which is finally utilized in desired cost function for the extraction of infectious regions in fruits and plants. The output of the cost function is in the form of infectious and normal pixels.
where t ∈ F cost (i, j, k) and F out (t) represents the pixels showing infection, which are set in the threshold function to obtain a binary segmented image.
where F QD (i, j) represents the final Q.D-based segmented image and t i denotes the current enhanced image pixel. The Q.D segmentation results including their contour, mesh graph, and 3-D contour images are shown in Fig. 6.

Image fusion
Image fusion concept is mostly employed, where information from multiple sources (images) is consolidated into fewer images, usually a single one. In this article, a weighted coefficient of correlation (WCoC)-based technique is implemented for pixel-based fusion of two segmented images. Actual range of CoC lies between (−1 : 1), but in this work, we are working on binary images; therefore, the resultant image is a binary. This method finds a strong correlation between pixels of both images. The highest correlated pixels are assigned higher weights, while lower correlated pixels are considered to be a background and eliminated. Suppose {p 1 , p 2 , . . . , p n } are uncorrelated pixels from both segmented images T(i, j) and F QD (i, j) having the same standard deviation, the correlation coefficient is defined as: where γ 12 denotes a correlation between pixels which is initialized as y), then the mathematical formulation is done as: Then assign the weight and bias values which are selected to be 0.8 and 2.5.
The above equation is simplified as:

Analysis of segmentation results
For the analysis of proposed segmentation technique against each disease, we selected 400 image samples (100 against each disease -apple scab, apple rust, grapes rot leaves, and grape leaf blight); few can be seen in Fig. 8. Three measures are implemented to show the performance of the proposed method including accuracy, Jaccard Index, and false negative rate -calculated as follows: where R i,j is a proposed segmented image, S(i, j) is a ground truth, and TP l represents correlated pixels. Results in tabular are provided in Table 2, and graphical results along with their ground truths are shown in Figs. 9 and 10. Additionally, few other sample segmentation results are provided in Fig. 11. The maximum accuracy of 95.63% is achieved from the tested images; moreover, the minimum reported negative rate is 4.37, maximum Jaccard Index is 99.26%, overall average accuracy is 93.74%, average Jaccard Index is 94.17%, and negative rate is 6.26%. Average results are also plotted in Fig. 12, which describes a range of segmented accuracy on all selected images.

Feature extraction
Features play their vital role in recognizing the primary contents of an images or signals. Therefore, in the field of pattern recognition and machine learning, set of techniques are  proposed [40][41][42][43][44][45]. On the one hand, optimal set of features lead to an accurate classification, while, on the other hand, irrelevant and redundant features are one of the factors for high misclassifications. In this article, we are not only focusing on the utilization of multiple set of features but also avoiding feature redundancy by implementing a suitable feature selection method. We utilize three different types of features including statistical, color [46], and texture (segmented local binary patterns (SLBP)) from the segmented images. For color features, RGB, HSV, LAB, and YCbCr color spaces are used and four measures, mean, standard deviation, entropy, and skewness, are calculated against each channel. From each color space, we obtain a feature vector of size 1× 12, which increases up 1 × 48 for all selected color spaces, and N × 48 for N images.
For statistical features, Harlick [47] is implemented, which originally used 14 features, but we added 8 new features including correlation 2, cluster prominence, cluster shade, dissimilarity, energy, homogeneity 1, homogeneity 2, and max probability. Addition of these features improves the overall classification accuracy but also increases the  Table 3, and the final vector size is 1 × 88. LBP [48] belongs to a category of texture features, which captures the information related to the neighboring pixels. In this work, ' A' channel from LAB color space is utilized as an input for feature extraction, because it provides more information compared to other channels. The proposed segmented local binary pattern features (SLBPF) is based on three steps: (a) calculate the distance between extracted set of LBP features, (b) calculate the statistical features of LBP, and (c) calculate the entropy features of their 8 neighborhood features. The extracted features are simply concatenated each other and make a new feature vector of size 1 × 72.
where LBP is a feature vector and S(u) = 1 if u ≥ 0 0 if u < 0 is a threshold function, n = 8, g p denotes total number of neighbors, and g c is a pivot location [49]. Distance between feature is calculated using relation: , n ∈ n th features (33) where D ij denotes the distance matrix which is later utilized to compute the mean, variance, skewness, and kurtosis. Later, these metrics are concatenated to generate a new vector having dimension 1 × 64. The entropy features of each 8 neighboring features are computed as:

Feature name Equation
Auto correlation φ R = k l (k × l)P(k, l) Entropy φ H = k l P(k, l)logP(k, l) where a x and a y denote the neighboring ith and jth features; 8 entropy features are extracted and concatenated with the previous vector to obtain a new feature vector having size 1 × 72. Finally, all features are fused [50] to generate a resultant vector of size 1 × 208. The core architecture of feature extraction and selection is shown in Fig. 13.

Feature selection
To avoid redundancy, the feature selection step plays a primary role by eliminating and discarding the irrelevant and repeated information, hence selecting the most discriminant information. In this article, we implemented a new method based on rank correlation Find the entropy value of fused features and multiply by rank correlation; (c) set a threshold function to select those features, which are minimum to entropy-correlation value. It is given that extracted fused features f 1 , f 2 , ...f n are rank from 1 to n. We need to find out the correlation between the rank of given features. The rank correlation is defined as: where f 1 and f 2 represents the fused feature vector. The above equation solves and simplifies as f 1 , f 2 = n(n+1) 2 and (f 1 ) 2 , (f 2 ) 2 = n(n+1)(2n+1) 6 . Then calculating the difference between fused features, given as: As ϕ = f 1 − f 2 , where ϕ denotes the difference between features and taking square both sides and apply and divided by 2 both sides, then it will become as f 1 f 2 = n(n+1)(2n+1) Similarly, n f 2 1 and n f 2 2 is = n 2 (n 2 −1)

12
. Put these simplifications in Eq. 36 and becomes: where Then calculate the entropy value of fused feature vector and multiply it with the correlation. The obtained value is compared with each feature of fused vector and select the features based of final threshold function as follows: Resultant vector − −−− → F(Vec) is utilized for final classification. We performed simulations several times and found selected vector in the range of 180-195. In several experiments, mostly the selected vector size is between 180 and 195. Finally, the multiclass SVM [51] is used as a base classifier for the classification of apple and grape diseases, and its classification results were compared with other well-known classification methods such as ensemble, decision trees, etc. Two kernel functions of SVM are utilized in this work such as linear and radial basis function (RBF). The linear kernel is used for binary class problem along other parameters such as kernel scale is automatic, classification method is one vs one, and standardized data is true. Similar for RBF kernel, the other parameters include a kernel scale is manual, box constraint level is 4, multi-class method is one vs all, and gamma is initialized as 0.3.

Experimental results and discussion
In this section, the proposed method is validated on a publicly available dataset, PlantVillage [52] -containing set of diseased and healthy images (Fig. 14). To prove the authenticity of the proposed algorithm, firstly, individual features are classified and latter fusion and selection is applied.

Apple scab disease
In this section, the classification results on apple scab diseases are presented. Total 2275 images of apple scab (630) and apple healthy (1645) are collected from the PlantVillage dataset. The results are accomplished in two phases. In the first phase, the results are obtained from each extracted set of features as depicted in Table 4 having maximum accuracy on multi-class SVM 94.1%, 86.3%, and 72.0% for SLBP, statistical, and color features, respectively. Then these results are compared with the proposed entropy-rank correlation-based selection method. Table 5 shows a maximum accuracy of 97.1%, FNR 2.9%, sensitivity 96.15%, specificity 96.2%, FPR 0.039, and precision 96.10%. Proposed results are confirmed with their confusion matrix of apple scab given in Table 6. From Tables 4 and 5, it is clearly shown that the proposed feature selection method produced best results as compared to individual set of features. Moreover, the proposed method  is also compared with previous state-of-the-art methods as presented in Table 7, which gives the authenticity of the proposed entropy-rank correlation method.

Apple rust disease
A total of 1920 images are collected from the PlantVillage dataset containing apple rust (275) and apple healthy (1645) images. The experiments are being performed in two steps, where in the first step classification results are obtained on each extracted set of features   (Table 9). Classification results are also confirmed using confusion matrix given in Table 6. From Tables 8 and 9, it is quite cleared, with the proposed feature selection method, performance improved significantly. Additionally, proposed classification results are also compared with the existing methods given in Table 7.

Grape diseases
Two types of grape diseases, grapes rot leave and grapes leaf blight, are selected in this section for classification. Total 2679 images are collected from the PlantVillage dataset which include grapes black rot (1180), grapes leaf blight (1076), and healthy (423). The same trend is being followed; in the first step, classification results are obtained on each extracted set of feature (Table 10). In Table 10, the classification results are obtained on  The bold values indicate best results grapes rot leaves having accuracy 93.2%, 90.9%, and 95.8% for SLBP, Harlick, and color features, respectively. Also, the proposed classification results of grapes leaf blight are presented in Table 11 with maximum accuracy of 96.30% -also confirmed from the confusion matrix (Table 6). Finally, the proposed results are compared with existing methods described in Table 7, which shows that the proposed method performs significantly well compared to existing methods.

Final classification
In this section, all selected diseases are utilized for classification, and the proposed method is directly implemented on it. The testing results are given in Table 12 having a maximum accuracy of 97.1% on multi-class SVM. The proposed testing results are confirmed by their confusion matrix given in Table 13, which shows the authenticity of the proposed method.  The bold values indicate best results

Discussion
On a broader perspective, two primary domains are somewhat covered: (1) infected region segmentation and (2) Tables 4, 8, and 10. The proposed entropy-rank correlation results are presented in Tables 5, 9, 14, 11, and 12, which are confirmed by confusion matrix given in Tables 6 and 13, which clearly shows the authenticity of the proposed method. Additionally, 8 new statistical features improve the overall accuracy by embedding set of unique features (Fig. 15). In Fig. 15, it is explained that when 14 texture features are computed, then the achieved accuracies are 81.9%, 82.7%, 81.8%, and 84.5% for apple scab, rust, grapes rot, and grape blights, respectively, whereas the addition of 8 features increases the overall accuracy to 86.3%, 87.2%, 90.9%, and 91.7%, respectively. In Fig. 16, the F1 score is calculated for the proposed feature selection approach. The F1 score is computed for all selected diseases such as apple scab, apple rust, grapes rot, and grapes leaf blight. The proposed feature selection results in terms of sensitivity, precision, F1 score, and accuracy show that the proposed feature selection method performed better as compared to individual feature sets. Finally, a comparison is conducted with latest techniques in Table 7 which shows that the proposed method performs significantly well as compared to existing methods.  The bold values indicate best results

Conclusion
Detection and classification of fruit diseases is an important research area in the field of computer vision and pattern recognition. Due to the complexity and irregularity of diseases in apple and grape leaves/fruits, several existing methods are unable to achieve the required classification accuracy. Therefore, in this article, a new technique is implemented for apple and grape disease detection and classification, which is based on fusion of a novel adaptive thresholding and Q.D-based segmentation. Later on, set of different features are extracted to perform a serial-based fusion. A novel entropy-rank correlation technique is implemented for robust feature selection, which works efficiently, compared to individual features and existing related methods in terms of accuracy, sensitivity, precision, and FPR. The proposed method works not only efficiently on WEB images but also efficiently for publicly available datasets, which contains a lot of challenges like noise and background complexity, to name but a few. From this research, we finally conclude that a combination of set of different features increases the overall accuracy but also increases the computational time and complexity. Therefore, it is somewhat mandatory to involve a feature selection method. A segmentation step plays its role in the extraction of better features -leading to better classification. As a future work, deep features will be utilized instead of conventional, as well as, number of disease will be increase, but the selection step is somewhat obligatory even with the deep features.