A Novel Feature Selection Method Based on MRMR and Enhanced Flower Pollination Algorithm for High Dimensional Biomedical Data

Chaokun      Yan; Mengyuan      Li; Jingjing      Ma; Yi      Liao; Huimin      Luo; Jianlin      Wang; Junwei      Luo

Abstract

Background: The massive amount of biomedical data accumulated in the past decades can be utilized for diagnosing disease.

Objective: However, the high dimensionality, small sample sizes, and irrelevant features of data often have a negative influence on the accuracy and speed of disease prediction. Some existing machine learning models cannot capture the patterns on these datasets accurately without utilizing feature selection.

Methods: Filter and wrapper are two prevailing feature selection methods. The filter method is fast but has low prediction accuracy, while the latter can obtain high accuracy but has a formidable computation cost. Given the drawbacks of using filter or wrapper individually, a novel feature selection method, called MRMR-EFPATS, is proposed, which hybridizes filter method Minimum Redundancy Maximum Relevance (MRMR) and wrapper method based on an improved Flower Pollination Algorithm (FPA). First, MRMR is employed to rank and screen out some important features quickly. These features are further chosen for individual populations following the wrapper method for faster convergence and less computational time. Then, due to its efficiency and flexibility, FPA is adopted to further discover an optimal feature subset.

Results: FPA still has some drawbacks, such as slow convergence rate, inadequacy in terms of searching new solutions, and tends to be trapped in local optima. In our work, an elite strategy is adopted to improve the convergence speed of the FPA. Tabu search and Adaptive Gaussian Mutation are employed to improve the search capability of FPA and escape from local optima. Here, the KNN classifier with the 5-fold-CV is utilized to evaluate the classification accuracy.

Conclusion: Extensive experimental results on six public high dimensional biomedical datasets show that the proposed MRMR-EFPATS has achieved superior performance compared to other state-of-theart methods.

Keywords: Feature selection, flower pollination algorithm, MRMR, elite strategy, adaptive gaussian mutation, tabu search.

Graphical Abstract

[1] 
Lee K, Man Z, Wang D, et al. Classification of microarray datasets using finite impulse response extreme learning machine for cancer diagnosis. Neural Comput Appl  2013; 22(3-4): 457-68.
[http://dx.doi.org/10.1007/s00521-012-0847-z] 
[2] 
Liu H, Zhao Z. Manipulating data and dimension reduction methods: Feature selection.Encyclopedia of complexity and systems science.   2009; pp. 5348-9.
[3] 
Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Base Syst  2018; 140: 103-19.
[http://dx.doi.org/10.1016/j.knosys.2017.10.028] 
[4] 
Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. Tenth National Conf Artificial Intelligence  1992; 129-34.
[5] 
Kang C, Huo Y, Xin L, Tian B, Yu B. Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol  2019; 463: 77-91.
[http://dx.doi.org/10.1016/j.jtbi.2018.12.010] [PMID:  30537483] 
[6] 
Martín-Valdivia MT, Díaz-Galiano MC, Montejo-Raez A, et al. Using information gain to improve multi-modal information retrieval systems. Inf Process Manage  2008; 44(3): 1146-58.
[http://dx.doi.org/10.1016/j.ipm.2007.09.014] 
[7] 
Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: Regularized t -test and statistical inferences of gene changes. Bioinformatics  2001; 17(6): 509-19.
[http://dx.doi.org/10.1093/bioinformatics/17.6.509 ] [PMID:  11395427] 
[8] 
Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. In: Bergadano F., De Raedt L. (eds) Machine Learning: ECML-94. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), Springer, Berlin, Heidelberg.  1994; 784: pp. 171-82.
[http://dx.doi.org/10.1007/3-540-57868-4_57] 
[9] 
Dashtban M, Balafar M. Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics  2017; 109(2): 91-107.
[http://dx.doi.org/10.1016/j.ygeno.2017.01.004] [PMID:  28159597] 
[10] 
Xuan J, Wang Y, Dong Y, et al. Gene selection for multiclass prediction by weighted Fisher criterion. EURASIP J Bioinform Syst Biol  2007; 2007: 64628.
[http://dx.doi.org/10.1155/2007/64628] [PMID:  17713593] 
[11] 
Peng H, Long F, Ding C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell  2005; 27(8): 1226-38.
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID:  16119262] 
[12] 
Hall MA. Correlation-based feature selection for machine learning  PhD Thesis, The University of Waikato: Hamilton, April. 1999.
[13] 
Hu Z, Bao Y, Xiong T, et al. Hybrid filter–wrapper feature selection for short-term load forecasting. Eng Appl Artif Intell  2015; 40: 17-27.
[http://dx.doi.org/10.1016/j.engappai.2014.12.014] 
[14] 
Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc B  1996; 58(1): 267-88.
[http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x] 
[15] 
Zhai Y, Huang X, Chang X. 
[16] 
Sun Z, Fan Y, Lelieveldt BPF, et al. Detection of alzheimer’s disease using group lasso svm-based region selectionMedical imaging Computer-aided diagnosis.  Bellingham, WA, US: International Society for Optics and Photonics 2015; p. 941414.
[17] 
Climente-González H, Azencott CA, Kaski S, Yamada M. Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data. Bioinformatics  2019; 35(14): i427-35.
[18] 
Faris H, Mafarja MM, Heidari AA, et al. An efficient binary salp swarm algorithm with crossover scheme for feature selection problems. Knowl Base Syst  2018; 154: 43-67.
[http://dx.doi.org/10.1016/j.knosys.2018.05.009] 
[19] 
Subanya B, Rajalaxmi RR. Feature selection using Artificial Bee Colony for cardiovascular disease classification International Conference on Electronics and Communication Systems (ICECS).  1-6.
[20] 
Subanya B, Rajalaxmi R. A novel feature selection algorithm for heart disease classification. Int J Comput Intell Informatic  2014; 4(2): 117-24.
[21] 
Salem H, Attiya G, El-Fishawy N. Classification of human cancer diseases by gene expression profiles. Appl Soft Comput  2017; 50: 124-34.
[http://dx.doi.org/10.1016/j.asoc.2016.11.026] 
[22] 
Jain I, Jain VK, Jain R. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Appl Soft Comput  2018; 62: 203-15.
[http://dx.doi.org/10.1016/j.asoc.2017.09.038] 
[23] 
Dashtban M, Balafar M, Suravajhala P. Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics  2018; 110(1): 10-7.
[http://dx.doi.org/10.1016/j.ygeno.2017.07.010] [PMID:  28780377] 
[24] 
Sayed SAEF, Nabil E, Badr A. A binary clonal flower pollination algorithm for feature selection. Pattern Recognit Lett  2016; 77: 21-7.
[http://dx.doi.org/10.1016/j.patrec.2016.03.014] 
[25] 
Sahu B. A combo feature selection method (filter+ wrapper) for microarray gene classification. Int J Pure Appl Math  2018; 118(16): 389-401.
[26] 
De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: An R package for parallelized mRMR ensemble feature selection. Bioinformatics  2013; 29(18): 2365-8.
[http://dx.doi.org/10.1093/bioinformatics/btt383] [PMID:  23825369] 
[27] 
Al-Betar MA, Awadallah MA, Doush IA, et al. Island flower pollination algorithm for global optimization. J Supercomput  2019; 75(8): 5280-323.
[http://dx.doi.org/10.1007/s11227-019-02776-y] 
[28] 
El-Shahat D, Abdel-Basset M, El-Henawy I, et al. A modified flower pollination algorithm for the multidimensional knapsack problem: Human-centric decision making. Soft Comput  2017; 22(3): 1-19.
[29] 
Ouaar F, Khelil N. Solving initial value problems by flower pollination algorithm. Am J Elec Comput Eng  2018; 2(2): 31-6.
[30] 
Zhou G, Wang R, Zhou Y. Flower pollination algorithm with runway balance strategy for the aircraft landing scheduling problem. Cluster Comput  2018; 21: 1543-60.
[http://dx.doi.org/10.1007/s10586-018-2051-0] 
[31] 
Rodrigues D, Silva G F A, Papa JP, et al. EEG-based person identification through binary flower pollination algorithm. Expert Syst Appl  2016; 62: 81-90.
[http://dx.doi.org/10.1016/j.eswa.2016.06.006] 
[32] 
Abdel-Basset M, Shawky LA. Flower pollination algorithm: A comprehensive review. Artif Intell Rev  2019; 52(4): 2533-57.
[http://dx.doi.org/10.1007/s10462-018-9624-4] 
[33] 
Xia X, Liu J, Li Y. Particle swarm optimization algorithm with reverse-learning and local-learning behavior. J Softw  2014; 9(2): 350-7.
[http://dx.doi.org/10.4304/jsw.9.2.350-357] 
[34] 
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics  2020; 21(1): 6.
[http://dx.doi.org/10.1186/s12864-019-6413-7] [PMID:  31898477] 
[35] 
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLoS One  2017; 12(6)e0177678
[http://dx.doi.org/10.1371/journal.pone.0177678] [PMID:  28574989] 
[36] 
Ni B, Liu J. A hybrid filter/wrapper gene selection method for microarray classification. Proceedings of 2004 international conference on machine learning and cybernetics. 2004 March 7-11; Hong Kong, China: IEEE 2004.
[37] 
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. Inf Eng Appl  2013; 3(10): 27-38.
[38] 
Pashaei E, Aydin N. Binary black hole algorithm for feature selection and classification on biological data. Appl Soft Comput  2017; 56: 94-106.
[http://dx.doi.org/10.1016/j.asoc.2017.03.002] 
[39] 
Shukran MAM, Chung YY, Yeh WC, et al. Artificial bee colony based data mining algorithms for classification tasks. Mod Appl Sci  2011; 5(4): 217.
[http://dx.doi.org/10.5539/mas.v5n4p217] 
[40] 
Guo Z, Yang H, Liu S, et al. Gravitational search algorithm with Gaussian mutation strategy. Int J of Wireless Mobile Comput  2017; 12(2): 191-7.
[http://dx.doi.org/10.1504/IJWMC.2017.084184] 
[41] 
Hinterding R. Gaussian mutation and self-adaption for numeric
genetic algorithms.Proceedings of 1995 IEEE International Conference
on Evolutionary Computation; 1995; Perth, WA, Australia:
IEEE. 2002.
[http://dx.doi.org/10.1109/ICEC.1995.489178] 
[42] 
Zhu Z, Ong YS, Dash M. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit  2007; 40(11): 3236-48.
[http://dx.doi.org/10.1016/j.patcog.2007.02.007] 
[43] 
Mantegna RN, Stanley HE. Stochastic process with ultraslow convergence to a Gaussian: The truncated Lévy flight. Phys Rev Lett  1994; 73(22): 2946-9.
[http://dx.doi.org/10.1103/PhysRevLett.73.2946] [PMID:  10057243] 
[44] 
Nabil E. A modified flower pollination algorithm for global optimization. Expert Syst Appl  2016; 57: 192-203.
[http://dx.doi.org/10.1016/j.eswa.2016.03.047] 
[45] 
Hu B, Dai Y, Su Y, et al. Feature selection for optimized high-dimensional biomedical data using an improved shuffled frog leaping algorithm. IEEE/ACM Trans Comput Biol Bioinformatics  2018; 15(6): 1765-73.
[http://dx.doi.org/10.1109/TCBB.2016.2602263] [PMID:  28113635] 
[46] 
Xu S, Wang Y, Liu X. Parameter estimation for chaotic systems via a hybrid flower pollination algorithm. Neural Comput Appl  2018; 30(8): 2607-23.
[http://dx.doi.org/10.1007/s00521-017-2890-2] 
[47] 
Alyasseri ZAA, Khader AT, Al-Betar MA, et al. EEG feature extraction for person identification using wavelet decomposition and multi-objective flower pollination algorithm. IEEE Access  2018; 6: 76007-24.
[http://dx.doi.org/10.1109/ACCESS.2018.2881470] 
[48] 
Holmfeldt P, Brännström K, Stenmark S, Gullberg M. Aneugenic activity of Op18/stathmin is potentiated by the somatic Q18-->e mutation in leukemic cells. Mol Biol Cell  2006; 17(7): 2921-30.
[http://dx.doi.org/10.1091/mbc.e06-02-0165] [PMID:  16624860] 
[49] 
Chang CL, Hora N, Huberman N, et al. Oncoprotein 18 levels and phosphorylation mediate megakaryocyte polyploidization in human erythroleukemia cells. Proteomics  2001; (11): 1415-23.
[http://dx.doi.org/10.1002/1615-9861(200111)1:11<1415:AID-PROT1415>3.0.CO;2-F] 
[50] 
Melhem RF, Zhu XX, Hailat N, Strahler JR, Hanash SM. Characterization of the gene for a proliferation-related phosphoprotein (oncoprotein 18) expressed in high amounts in acute leukemia. J Biol Chem  1991; 266(27): 17747-53.
[http://dx.doi.org/10.1016/S0021-9258(18)55189-9] [PMID:  1917919] 
[51] 
Zhu XX, Kozarsky K, Strahler JR, et al. Molecular cloning of a novel human leukemia-associated gene. Evidence of conservation in animal species. J Biol Chem  1989; 264(24): 14556-60.
[http://dx.doi.org/10.1016/S0021-9258(18)71714-6] [PMID:  2760073] 
[52] 
Sellin ME, Holmfeldt P, Stenmark S, Gullberg M. Op18/Stathmin counteracts the activity of overexpressed tubulin-disrupting proteins in a human leukemia cell line. Exp Cell Res  2008; 314(6): 1367-77.
[http://dx.doi.org/10.1016/j.yexcr.2007.12.018] [PMID:  18262179] 
[53] 
Bertoli S, Paubelle E, Bérard E, et al. Ferritin heavy/light chain (FTH1/FTL) expression, serum ferritin levels, and their functional as well as prognostic roles in acute myeloid leukemia. Eur J Haematol  2019; 102(2): 131-42.
[http://dx.doi.org/10.1111/ejh.13183] [PMID:  30325535] 
[54] 
Castronuevo P, Thornton MA, McCarthy LE, Klimas J, Schick BP. DNase I hypersensitivity patterns of the serglycin proteoglycan gene in resting and phorbol 12-myristate 13-acetate-stimulated human erythroleukemia (HEL), CHRF 288-11, and HL-60 cells compared with neutrophils and human umbilical vein endothelial cells. J Biol Chem  2003; 278(49): 48704-12.
[http://dx.doi.org/10.1074/jbc.M310220200] [PMID:  14506241] 
[55] 
Stevens RL, Avraham S, Gartner MC, Bruns GA, Austen KF, Weis JH. Isolation and characterization of a cDNA that encodes the peptide core of the secretory granule proteoglycan of human promyelocytic leukemia HL-60 cells. J Biol Chem  1988; 263(15): 7287-91.
[http://dx.doi.org/10.1016/S0021-9258(18)68639-9] [PMID:  2835370] 
[56] 
Nicodemus CF, Avraham S, Austen KF, Purdy S, Jablonski J, Stevens RL. Characterization of the human gene that encodes the peptide core of secretory granule proteoglycans in promyelocytic leukemia HL-60 cells and analysis of the translated product. J Biol Chem  1990; 265(10): 5889-96.
[http://dx.doi.org/10.1016/S0021-9258(19)39446-3] [PMID:  2180935] 
[57] 
Humphries DE, Nicodemus CF, Schiller V, Stevens RL. The human serglycin gene. Nucleotide sequence and methylation pattern in human promyelocytic leukemia HL-60 cells and T-lymphoblast Molt-4 cells. J Biol Chem  1992; 267(19): 13558-63.
[http://dx.doi.org/10.1016/S0021-9258(18)42248-X] [PMID:  1377686] 
[58] 
Avraham S, Stevens RL, Gartner MC, Austen KF, Lalley PA, Weis JH. Isolation of a cDNA that encodes the peptide core of the secretory granule proteoglycan of rat basophilic leukemia-1 cells and assessment of its homology to the human analogue. J Biol Chem  1988; 263(15): 7292-6.
[http://dx.doi.org/10.1016/S0021-9258(18)68640-5] [PMID:  3366780] 
[59] 
Finney OC, Brakke HM, Rawlings-Rhea S, et al. CD19 CAR T cell product and disease attributes predict leukemia remission durability. J Clin Invest  2019; 129(5): 2123-32.
[http://dx.doi.org/10.1172/JCI125423] [PMID:  30860496] 
[60] 
Cherian S, Miller V, McCullouch V, Dougherty K, Fromm JR, Wood BL. A novel flow cytometric assay for detection of residual disease in patients with B-lymphoblastic leukemia/lymphoma post anti-CD19 therapy. Cytometry B Clin Cytom  2018; 94(1): 112-20.
[http://dx.doi.org/10.1002/cyto.b.21482] [PMID:  27598971] 
[61] 
Francis J, Dharmadhikari AV, Sait SNJ, et al. CD19 expression in acute leukemia is not restricted to the cytogenetically aberrant populations. Leuk Lymphoma  2013; 54(7): 1517-20.
[http://dx.doi.org/10.3109/10428194.2012.754096] [PMID:  23193950] 
[62] 
Chen YH, Tang YM, Shen HQ, et al. [The expression of CD19 in 210 cases of childhood acute leukemia and its significance]. Zhonghua Er Ke Za Zhi  2004; 42(3): 188-91.
[PMID:  15144712] 
[63] 
Rosenthal J, Naqvi AS, Luo M, et al. Heterogeneity of surface CD19 and CD22 expression in B lymphoblastic leukemia. Am J Hematol  2018; 93(11): E352-5.
[http://dx.doi.org/10.1002/ajh.25235] [PMID:  30058145] 
[64] 
Sakamoto K, Shiba N, Deguchi T, et al. Negative CD19 expression is associated with inferior relapse-free survival in children with RUNX1-RUNX1T1-positive acute myeloid leukaemia: Results from the japanese paediatric leukaemia/lymphoma study group aml-05 study. Br J Haematol  2019; 187(3): 372-6.
[http://dx.doi.org/10.1111/bjh.16080] [PMID:  31247675] 
[65] 
Raponi S, De Propris MS, Intoppa S, et al. Flow cytometric study of potential target antigens (CD19, CD20, CD22, CD33) for antibody-based immunotherapy in acute lymphoblastic leukemia: Analysis of 552 cases. Leuk Lymphoma  2011; 52(6): 1098-107.
[http://dx.doi.org/10.3109/10428194.2011.559668] [PMID:  21348573] 
[66] 
Zhang G, Shi Y, Maleki A, et al. Optimal location and size of a grid-independent solar/hydrogen system for rural areas using an efficient heuristic approach. Renew Energy  2020; 156: 1203-14.
[http://dx.doi.org/10.1016/j.renene.2020.04.010] 

Cite As

Current Bioinformatics

A Novel Feature Selection Method Based on MRMR and Enhanced Flower Pollination Algorithm for High Dimensional Biomedical Data

Abstract

Graphical Abstract