Multi-objective Evolutionary Approach for the Performance Improvement of Learners using Ensembling Feature Selection and Discretization Technique on Medical Data

Page: [355 - 370] Pages: 16

  • * (Excluding Mailing and Handling)

Abstract

Background: Biomedical data is filled with continuous real values; these values in the feature set tend to create problems like underfitting, the curse of dimensionality and increase in misclassification rate because of higher variance. In response, pre-processing techniques on dataset minimizes the side effects and have shown success in maintaining the adequate accuracy.

Aims: Feature selection and discretization are the two necessary preprocessing steps that were effectively employed to handle the data redundancies in the biomedical data. However, in the previous works, the absence of unified effort by integrating feature selection and discretization together in solving the data redundancy problem leads to the disjoint and fragmented field. This paper proposes a novel multi-objective based dimensionality reduction framework, which incorporates both discretization and feature reduction as an ensemble model for performing feature selection and discretization. Selection of optimal features and the categorization of discretized and non-discretized features from the feature subset is governed by the multi-objective genetic algorithm (NSGA-II). The two objectives, minimizing the error rate during the feature selection and maximizing the information gain, while discretization is considered as fitness criteria.

Methods: The proposed model used wrapper-based feature selection algorithm to select the optimal features and categorized these selected features into two blocks namely discretized and nondiscretized blocks. The feature belongs to the discretized block will participate in the binary discretization while the second block features will not be discretized and used in its original form.

Results: For the establishment and acceptability of the proposed ensemble model, the experiment is conducted on the fifteen medical datasets, and the metric such as accuracy, mean and standard deviation are computed for the performance evaluation of the classifiers.

Conclusion: After an extensive experiment conducted on the dataset, it can be said that the proposed model improves the classification rate and outperform the base learner.

Keywords: Dimensionality reduction, discretization, evolutionary algorithm, feature selection, non-dominated sorting genetic algorithm, binary discretization.

Graphical Abstract

[1]
Le TM, Paul JS, Ong SH. Computational biology. Appl Bioinformatics 2010; 673(1): 243-71.http://www.springerlink.com/index/10.1007/978-1-4419-0811-7
[2]
Song J, Tan H, Perry AJ, et al. PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites. PLoS One 2012; 7(11) e50300
[http://dx.doi.org/10.1371/journal.pone.0050300] [PMID: 23209700]
[3]
Winiarski T, Biesiada J, Kachel A, et al. Feature ranking, selection and discretization. ICANN 2003; 2003: 251-4.
[4]
Houari R, Bounceur A, Kechadi M, Tari A, Euler R. Dimensionality reduction in data mining : A Copula approach. Expert Syst Appl 2016; 64: 247-60.
[http://dx.doi.org/10.1016/j.eswa.2016.07.041]
[5]
Horng J-T, Wu L-C, Liu B-J, Kuo J-L, Kuo W-H, Zhang J-J. An expert system to classify microarray gene expression data using gene selection by decision tree. Expert Syst Appl 2009; 36(5): 9072-81.
[http://dx.doi.org/10.1016/j.eswa.2008.12.037]
[6]
Chandra B, Gupta M. An efficient statistical feature selection approach for classification of gene expression data. J Biomed Inform 2011; 44(4): 529-35.
[http://dx.doi.org/10.1016/j.jbi.2011.01.001] [PMID: 21241823]
[7]
Lazar C, Taminau J, Meganck S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(4): 1106-19.
[http://dx.doi.org/10.1109/TCBB.2012.33] [PMID: 22350210]
[8]
Li L, Luo Q, Xiao W, et al. A machine-learning approach for predicting palmitoylation sites from integrated sequence-based features. J Bioinform Comput Biol 2017; 15(1) 1650025
[http://dx.doi.org/10.1142/S0219720016500256] [PMID: 27411307]
[9]
Ren Y, Wang Q, Chen S, Feng X, Cao H, Zhou P. 2D depiction of biological interactions and its applications in drug design. Curr Med Imaging 2013; 9(1): 18-24.
[http://dx.doi.org/10.2174/1573405611309010004]
[10]
Ezzat A, Wu M, Li X-L, Kwoh C-K. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics 2016; 17: 509.
[http://dx.doi.org/10.1186/s12859-016-1377-y] [PMID: 28155697]
[11]
García S, Luengo J, Sáez JA, López V, Herrera F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 2013; 25(4): 734-50.
[http://dx.doi.org/10.1109/TKDE.2012.35]
[12]
Liu H, Hussain F, Tan CL, Dash M. Discretization: An enabling technique. Data Min Knowl Discov 2002; 6(4): 393-423.
[http://dx.doi.org/10.1023/A:1016304305535]
[13]
Yang Y, Webb GI. Discretization for naive-Bayes learning: Managing discretization bias and variance. Mach Learn 2009; 74(1): 39-74.
[http://dx.doi.org/10.1007/s10994-008-5083-5]
[14]
Chan C. Determination of quantization intervals in rule based Decision Aiding for Complex Systems. IEEE 1991; 91: 1719-23.
[15]
Arbor A. On the handling of continuous-valued attributes in decision tree generation. Mach Learn 1992; 8(1): 87-102.
[http://dx.doi.org/10.1007/BF00994007]
[16]
Dua S, Acharya UR, Dua P. Machine learning in healthcare informatics Berlin: Springer 2014 http://link.springer.com/10.1007/978-3-642-40017-9
[17]
Smola A, Gretton A, Song L, Schölkopf B. A hilbert space embedding for distributions. Discovery Science 2007; 4755: 40-51.
[18]
Rosales-Perez A, Garcia S, Gonzalez JA, Coello CA, Herrera F. An evolutionary multi-objective model and instance selection for support vector machines with pareto-based ensembles. IEEE Trans Evol Comput 2017; 21(6): 1-1.
[http://dx.doi.org/10.1109/TEVC.2017.2688863]
[19]
Kim YW, Oh IS. Classifier ensemble selection using hybrid genetic algorithms. Pattern Recognit Lett 2008; 29(6): 796-802.
[http://dx.doi.org/10.1016/j.patrec.2007.12.013]
[20]
Ramírez-Gallego S, García S, Benítez JM, Herrera F. Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybern 2016; 46(3): 595-608.
[http://dx.doi.org/10.1109/TCYB.2015.2410143] [PMID: 25794409]
[21]
Zitzler E, Thiele L. Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach. IEEE Trans Evol Comput 1999; 3(4): 257-71.
[http://dx.doi.org/10.1109/4235.797969]
[22]
Sunil RRC, Khanna NH, Shiloah ED, Kannan A. distance based genetic algorithm for feature selection in computer aided diagnosis systems. Curr Med Imaging 2017; 13(3): 284-98.
[23]
Deb K, Agrawal S. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: International Conference on Parallel Problem Solving From Nature. 2000 Sep 18-20; Paris, France. Springer 2000; pp. 849-58.
[http://dx.doi.org/10.1007/3-540-45356-3_83]
[24]
Sivasankari K, Thanushkodi KG, Suguna N. Optimized feature selection for enhanced epileptic seizure detection. Curr Med Imaging 2014; 10(1): 35-47.
[http://dx.doi.org/10.2174/157340561001140424143814]
[25]
Tahan MH, Asadi S. EMDID: Evolutionary multi-objective discretization for imbalanced datasets. Inf Sci 2018; 432: 442-61.
[http://dx.doi.org/10.1016/j.ins.2017.12.023]
[26]
Li J, Li X, Zhang W. A filter feature selection method based llrfc and redundancy analysis for tumor classification using gene expression data. In: 12th World Congress on Intelligent Control and Automation (WCICA). 12-15 June 2016;. Guilin, China. IEEE 2016:; pp. 2861-7.
[http://dx.doi.org/10.1109/WCICA.2016.7578590]
[27]
Wang A, An N, Yang J, Chen G, Li L, Alterovitz G. Wrapper-based gene selection with Markov blanket. Comput Biol Med 2017; 81(81): 11-23.
[http://dx.doi.org/10.1016/j.compbiomed.2016.12.002] [PMID: 28006702]
[28]
Shunmugapriya P, Kanmani S. A hybrid algorithm using ant and bee colony optimization for feature selection and classification (AC-ABC Hybrid). Swarm Evol Comput 2017; 36: 27-36.
[29]
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Trans Comput Biol Bioinformatics 2016; 13(5): 971-89.
[http://dx.doi.org/10.1109/TCBB.2015.2478454] [PMID: 26390495]
[30]
Leardi R, Boggia R, Terrile M. Genetic Algorithms as a strategy for feature-selection. J Chemometrics 1992; 6: 267-81.
[31]
Derrac J, García S, Herrera F. A first study on the use of coevolutionary algorithms for instance and feature selection. In: International Conference on Hybrid Artificial Intelligence Systems. 2009 10-12 June;. Salamanca, Spain. Springer 2009; pp. 557-64.
[http://dx.doi.org/10.1007/978-3-642-02319-4_67]
[32]
Dingiun C, Chan KC, Wu X. Gene expression analyses using genetic algorithm based hybrid approaches. In: Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence); 2008 1-6 June; Hong Kong, China. IEEE 2008; pp. 963-9.
[http://dx.doi.org/10.1109/CEC.2008.4630913]
[33]
Li R, Lu J, Zhang Y, Zhao T. Dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowl Base Syst 2010; 23(3): 195-201.
[http://dx.doi.org/10.1016/j.knosys.2009.11.020]
[34]
da Silva SF, Ribeiro MX. Batista Neto J do ES, Traina-Jr. C, Traina AJM. Improving the ranking quality of medical image retrieval using a genetic feature selection method. Decis Support Syst 2011; 51(4): 810-20.
[35]
Yang J, Honavar V. Feature subset selection using a genetic algorithm. IEEE Intell Syst Their Appl 1998; 13(2): 44-9.
[http://dx.doi.org/10.1109/5254.671091]
[36]
Sousa P, Cortez P, Vaz R, Rocha M, Rio M. Email span detecion: A symobolic feature selection approach fostered by evolutionary computation. Int J Inf Technol Decis Mak 2013; 12(04): 863-84.
[http://dx.doi.org/10.1142/S0219622013500326]
[37]
Huang B, Buckley B, Kechadi TM. Multi-objective feature selection by using NSGA-II for customer churn prediction in telecommunications. Expert Syst Appl 2010; 37(5): 3638-46.
[http://dx.doi.org/10.1016/j.eswa.2009.10.027]
[38]
Mukhopadhyay A, Maulik U. An SVM-wrapped multiobjective evolutionary feature selection approach for identifying cancer-microRNA markers. IEEE Trans Nanobioscience 2013; 12(4): 275-81.
[http://dx.doi.org/10.1109/TNB.2013.2279131] [PMID: 24235309]
[39]
Tay FEH, Shen L. A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 2002; 14(3): 666-70.
[http://dx.doi.org/10.1109/TKDE.2002.1000349]
[40]
Kurgan LA, Cios KJ. CAIM discretization algorithm. IEEE Trans Knowl Data Eng 2004; 16(2): 145-53.
[http://dx.doi.org/10.1109/TKDE.2004.1269594]
[41]
Liu X, Member S, Society IC, Wang H. A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 2005; 17(9): 1166-73.
[http://dx.doi.org/10.1109/TKDE.2005.135]
[42]
Mehta S, Parthasarathy S, Yang H. Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 2005; 17(9): 1174-85.
[http://dx.doi.org/10.1109/TKDE.2005.153]
[43]
Method DD, Cerquides J, De Mantaras RL. Proposal and empirical comparison of a Parallelizable. KDD 1992; 139-42.
[44]
Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA. Ameva: An autonomous discretization algorithm. Expert Syst Appl 2009; 36: 5327-32.
[http://dx.doi.org/10.1016/j.eswa.2008.06.063]
[45]
Boull M. MODL : A Bayes optimal discretization method for continuous attributes. Mach Learn 2006; 65(1): 131-65.
[http://dx.doi.org/10.1007/s10994-006-8364-x]
[46]
Dash R, Paramguru RL, Dash R. Comparative analysis of supervised and unsupervised discretization techniques. Int J Adv Sci Technol 2011; 2(3): 29-37.
[47]
Hassan AR, Imamul M, Bhuiyan H. Automatic sleep scoring using statistical features in the EMD domain and ensemble methods. Integr Med Res 2015; 36(1): 248-55.
[48]
Ramírez-gallego S, García S, Benítez JM, Herrera F. A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evol Comput 2016; 2017: 1-10.
[49]
Kira K, Rendell L. A practical approach to feature selection. Mach Learn 1992; 9: 249-56.
[50]
Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I. Discretization of gene expression data revised. Brief Bioinform 2016; 17(5): 758-70.
[http://dx.doi.org/10.1093/bib/bbv074] [PMID: 26438418]
[51]
Holland JH. Genetic algorithms and the optimal allocation of trials. SIAM J Comput 1973; 2(2): 88-105.
[http://dx.doi.org/10.1137/0202009]
[52]
Huang CL, Wang CJ. A GA-based feature selection and parameters optimizationfor support vector machines. Expert Syst Appl 2006; 31(2): 231-40.
[http://dx.doi.org/10.1016/j.eswa.2005.09.024]
[53]
Lichman M. UCI machine learning repository 2013. Available from: http://archive.ics.uci.edu/ml