Integrating LASSO Feature Selection and Soft Voting Classifier to Identify
Origins of Replication Sites

Yingying      Yao; Shengli      Zhang; Tian      Xue

Abstract

Background: DNA replication plays an indispensable role in the transmission of genetic information. It is considered to be the basis of biological inheritance and the most fundamental process in all biological life. Considering that DNA replication initiates with a special location, namely the origin of replication, a better and accurate prediction of the origins of replication sites (ORIs) is essential to gain insight into the relationship with gene expression.

Objective: In this study, we have developed an efficient predictor called iORI-LAVT for ORIs identification.

Methods: This work focuses on extracting feature information from three aspects, including mononucleotide encoding, k-mer and ring-function-hydrogen-chemical properties. Subsequently, least absolute shrinkage and selection operator (LASSO) as a feature selection is applied to select the optimal features. Comparing the different combined soft voting classifiers results, the soft voting classifier based on GaussianNB and Logistic Regression is employed as the final classifier.

Results: Based on 10-fold cross-validation test, the prediction accuracies of two benchmark datasets are 90.39% and 95.96%, respectively. As for the independent dataset, our method achieves high accuracy of 91.3%.

Conclusion: Compared with previous predictors, iORI-LAVT outperforms the existing methods. It is believed that iORI-LAVT predictor is a promising alternative for further research on identifying ORIs.

Keywords: Origin of replication sites, multi-feature, LASSO, voting classifier, DNA replication, dimensional feature.

Graphical Abstract

[1] 
Halazonetis, T.D. Conservative DNA replication. Nat. Rev. Mol. Cell Biol.,  2014, 15(5), 300.
[http://dx.doi.org/10.1038/nrm3784] [PMID:  24667655] 
[2] 
Song, C.; Zhang, S.; Huang, H. Choosing a suitable method for the identification of replication origins in microbial genomes. Front. Microbiol.,  2015, 6, 1049.
[http://dx.doi.org/10.3389/fmicb.2015.01049] [PMID:  26483774] 
[3] 
Waga, S.; Stillman, B. The DNA replication fork in eukaryotic cells. Annu. Rev. Biochem.,  1998, 67, 721-751.
[http://dx.doi.org/10.1146/annurev.biochem.67.1.721] [PMID:  9759502] 
[4] 
Raghu Ram, E.V.; Kumar, A.; Biswas, S.; Kumar, A.; Chaubey, S.; Siddiqi, M.I.; Habib, S. Nuclear gyrB encodes a functional subunit of the Plasmodium falciparum gyrase that is involved in apicoplast DNA replication. Mol. Biochem. Parasitol.,  2007, 154(1), 30-39.
[http://dx.doi.org/10.1016/j.molbiopara.2007.04.001] [PMID:  17499371] 
[5] 
McFadden, G.I.; Roos, D.S. Apicomplexan plastids as drug targets. Trends Microbiol.,  1999, 7(8), 328-333.
[http://dx.doi.org/10.1016/S0966-842X(99)01547-4] [PMID:  10431206] 
[6] 
Soldati, D. The apicoplast as a potential therapeutic target in and other apicomplexan parasites. Parasitol. Today,  1999, 15(1), 5-7.
[http://dx.doi.org/10.1016/S0169-4758(98)01363-5] [PMID:  10234168] 
[7] 
Lubelsky, Y.; MacAlpine, H.K.; MacAlpine, D.M. Genome-wide localization of replication factors. Methods,  2012, 57(2), 187-195.
[http://dx.doi.org/10.1016/j.ymeth.2012.03.022] [PMID:  22465279] 
[8] 
Chen, J.Y.; Carlis, J.V. Genomic data modeling. Inf. Syst.,  2003, 28(4), 287-310.
[http://dx.doi.org/10.1016/S0306-4379(02)00071-6] 
[9] 
Griffith, M.; Griffith, O.L.; Smith, S.M.; Ramu, A.; Callaway, M.B.; Brummett, A.M.; Kiwala, M.J.; Coffman, A.C.; Regier, A.A.; Oberkfell, B.J.; Sanderson, G.E.; Mooney, T.P.; Nutter, N.G.; Belter, E.A.; Du, F.; Long, R.L.; Abbott, T.E.; Ferguson, I.T.; Morton, D.L.; Burnett, M.M.; Weible, J.V.; Peck, J.B.; Dukes, A.; McMichael, J.F.; Lolofie, J.T.; Derickson, B.R.; Hundal, J.; Skidmore, Z.L.; Ainscough, B.J.; Dees, N.D.; Schierding, W.S.; Kandoth, C.; Kim, K.H.; Lu, C.; Harris, C.C.; Maher, N.; Maher, C.A.; Magrini, V.J.; Abbott, B.S.; Chen, K.; Clark, E.; Das, I.; Fan, X.; Hawkins, A.E.; Hepler, T.G.; Wylie, T.N.; Leonard, S.M.; Schroeder, W.E.; Shi, X.; Carmichael, L.K.; Weil, M.R.; Wohlstadter, R.W.; Stiehr, G.; McLellan, M.D.; Pohl, C.S.; Miller, C.A.; Koboldt, D.C.; Walker, J.R.; Eldred, J.M.; Larson, D.E.; Dooling, D.J.; Ding, L.; Mardis, E.R.; Wilson, R.K. Genome modeling system: A knowledge management platform for genomics. PLOS Comput. Biol.,  2015, 11(7), e1004274.
[http://dx.doi.org/10.1371/journal.pcbi.1004274] [PMID:  26158448] 
[10] 
Gao, F.; Zhang, C.T. Ori-Finder: A web-based system for finding oriCs in unannotated bacterial genomes. BMC Bioinformatics,  2008, 9, 79.
[http://dx.doi.org/10.1186/1471-2105-9-79] [PMID:  18237442] 
[11] 
Luo, H.; Zhang, C.T.; Gao, F. Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes. Front. Microbiol.,  2014, 5, 482.
[http://dx.doi.org/10.3389/fmicb.2014.00482] [PMID:  25309521] 
[12] 
Sperlea, T. Muth, L.; Martin, R γ BOriS: Identification of origins of replication in Gammaproteobacteria using motifbased BioRxiv, 2019.
[http://dx.doi.org/10.1101/597070] 
[13] 
Dao, F.Y.; Lv, H.; Wang, F.; Ding, H. Recent advances on the machine learning methods in identifying DNA replication origins in eukaryotic genomics. Front. Genet.,  2018, 9, 613.
[http://dx.doi.org/10.3389/fgene.2018.00613] [PMID:  30619452] 
[14] 
Chen, W.; Feng, P.; Lin, H. Prediction of replication origins by calculating DNA structural properties. FEBS Lett.,  2012, 586(6), 934-938.
[http://dx.doi.org/10.1016/j.febslet.2012.02.034] [PMID:  22449982] 
[15] 
Li, W.C.; Deng, E.Z.; Ding, H.; Chen, W.; Lin, H. iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom. Intell. Lab. Syst.,  2015, 141, 100-106.
[http://dx.doi.org/10.1016/j.chemolab.2014.12.011] 
[16] 
Dao, F.Y.; Lv, H.; Wang, F.; Feng, C.Q.; Ding, H.; Chen, W.; Lin, H. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics,  2019, 35(12), 2075-2083.
[http://dx.doi.org/10.1093/bioinformatics/bty943] [PMID:  30428009] 
[17] 
Xing, Y.Q.; Liu, G.Q.; Zhao, X.J.; Zhao, H.Y.; Cai, L. Genome-wide characterization and prediction of Arabidopsis thaliana replication origins. Biosystems,  2014, 124, 1-6.
[http://dx.doi.org/10.1016/j.biosystems.2014.07.001] [PMID:  25050475] 
[18] 
Do, D.T.; Le, N.Q.K. Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features. Genomics,  2020, 112(3), 2445-2451.
[http://dx.doi.org/10.1016/j.ygeno.2020.01.017] [PMID:  31987913] 
[19] 
Dao, F.Y.; Lv, H.; Zulfiqar, H.; Yang, H.; Su, W.; Gao, H.; Ding, H.; Lin, H. A computational platform to identify origins of replication sites in eukaryotes. Brief. Bioinform.,  2021, 22(2), 1940-1950.
[http://dx.doi.org/10.1093/bib/bbaa017] [PMID:  32065211] 
[20] 
Manavalan, B.; Basith, S.; Shin, T.; Lee, G. Computational prediction of species-specific yeast DNA replication origin via iterative feature representation. Brief. Bioinform.,  2020, 22(4), bbaa304.
[http://dx.doi.org/10.1093/bib/bbaa304] [PMID:  33232970] 
[21] 
Wei, L.; He, W.; Malik, A.; Su, R.; Cui, L.; Manavalan, B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Brief. Bioinform.,  2020, 22(4), bbaa275.
[http://dx.doi.org/10.1093/bib/bbaa275] [PMID:  33152766] 
[22] 
Yao, Y.; Zhang, S.; Liang, Y. iORI-ENST: Identifying origin of replication sites based on elastic net and stacking learning. SAR QSAR Environ. Res.,  2021, 32(4), 317-331.
[http://dx.doi.org/10.1080/1062936X.2021.1895884] [PMID:  33730950] 
[23] 
Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics,  2006, 22(13), 1658-1659.
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID:  16731699] 
[24] 
Chen, Z.; Zhao, P.; Li, F.; Marquez-Lago, T.T.; Leier, A.; Revote, J.; Zhu, Y.; Powell, D.R.; Akutsu, T.; Webb, G.I.; Chou, K.C.; Smith, A.I.; Daly, R.J.; Li, J.; Song, J. iLearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief. Bioinform.,  2020, 21(3), 1047-1057.
[http://dx.doi.org/10.1093/bib/bbz041] [PMID:  31067315] 
[25] 
Zhang, Z.Y.; Yang, Y.H.; Ding, H.; Wang, D.; Chen, W.; Lin, H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief. Bioinform., 2020.
[http://dx.doi.org/10.1093/bib/bbz177] [PMID:  31994694] 
[26] 
Yang, H.; Yang, W.; Dao, F.Y.; Lv, H.; Ding, H.; Chen, W.; Lin, H. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform.,  2020, 21(5), 1568-1580.
[http://dx.doi.org/10.1093/bib/bbz123] [PMID:  31633777] 
[27] 
Bari, A.T.M.G.; Reaz, M.R.; Choi, H.J.; Jeong, B.S. DNA encoding for splice site prediction in large DNA sequence.Database Systems for Advanced Applications; Hong, B.; Meng, X.; Chen, L.; Winiwarter, W.; Song, W., Eds.; Springer: Berlin, Heidelberg, 2013, pp. 46-58.
[http://dx.doi.org/10.1007/978-3-642-40270-8_4] 
[28] 
Chen, W.; Feng, P.; Tang, H.; Ding, H.; Lin, H. Identifying 2-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. Genomics,  2016, 107(6), 255-258.
[http://dx.doi.org/10.1016/j.ygeno.2016.05.003] [PMID:  27191866] 
[29] 
Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics,  2017, 33(22), 3518-3523.
[http://dx.doi.org/10.1093/bioinformatics/btx479] [PMID:  28961687] 
[30] 
Wei, L.; Chen, H.; Su, R. M6APred-EL: A sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol. Ther. Nucleic Acids,  2018, 12, 635-644.
[http://dx.doi.org/10.1016/j.omtn.2018.07.004] [PMID:  30081234] 
[31] 
Wei, L.; Su, R.; Luan, S.; Liao, Z.; Manavalan, B.; Zou, Q.; Shi, X. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics,  2019, 35(23), 4930-4937.
[http://dx.doi.org/10.1093/bioinformatics/btz408] [PMID:  31099381] 
[32] 
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B,  1996, 58, 267-288.
[http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x] 
[33] 
Lee, T.F.; Chao, P.J.; Ting, H.M.; Chang, L.; Huang, Y.J.; Wu, J.M.; Wang, H.Y.; Horng, M.F.; Chang, C.M.; Lan, J.H.; Huang, Y.Y.; Fang, F.M.; Leung, S.W. Using multivariate regression model with least absolute shrinkage and selection operator (LASSO) to predict the incidence of Xerostomia after intensity-modulated radiotherapy for head and neck cancer. PLoS One,  2014, 9(2), e89700.
[http://dx.doi.org/10.1371/journal.pone.0089700] [PMID:  24586971] 
[34] 
Zhang, S.; Duan, Z.; Yang, W.; Qian, C.; You, Y. iDHS-DASTS: Identifying DNase I hypersensitive sites based on LASSO and stacking learning. Mol. Omics,  2021, 17(1), 130-141.
[http://dx.doi.org/10.1039/D0MO00115E] [PMID:  33295914] 
[35] 
Zhang, S.; Zhu, F.; Yu, Q.; Zhu, X. Identifying DNA-binding proteins based on multi-features and LASSO feature selection. Biopolymers,  2021, 112(2), e23419.
[http://dx.doi.org/10.1002/bip.23419] [PMID:  33476047] 
[36] 
Yu, H.F.; Huang, F.L.; Lin, C.J. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn.,  2011, 85(1-2), 41-75.
[http://dx.doi.org/10.1007/s10994-010-5221-8] 
[37] 
Friedman, N.; Geiger, D.; Pazzanzi, M. Bayesian network classifiers. Mach. Learn.,  1997, 2, 131-163.
[http://dx.doi.org/10.1023/A:1007465528199] 
[38] 
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. Proceeding of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,  2016Aug New York, NY, USA, pp. 785-794.
[39] 
Vapnik, V.N. Statistical Learning Theory; John Wiley & Sons: New York, 1998, pp. 1-768.
[40] 
Breiman, L. Random forest. Mach. Learn.,  2001, 45, 5-32.
[http://dx.doi.org/10.1023/A:1010933404324] 
[41] 
Zhang, S.L.; Li, X.J. Pep-CNN: An improved convolutional neural network for predicting therapeutic peptides. Chemom. Intell. Lab. Syst.,  2022, 221, 104490.
[http://dx.doi.org/10.1016/j.chemolab.2022.104490] 
[42] 
Alam, M.; Ali, S.D.; Tayara, H.; Chong, K.T. A CNN-based RNA N6-methyladenosine site predictor for multiple species using heterogeneous features representation. IEEE Access,  2020, 8, 138203-138209.
[http://dx.doi.org/10.1109/ACCESS.2020.3002995] 
[43] 
Tahir, M.; Hayat, M.; Chong, K.T. Prediction of N6-methyladenosine sites using convolution neural network model based on distributed feature representations. Neural Netw.,  2020, 129, 385-391.
[http://dx.doi.org/10.1016/j.neunet.2020.05.027] [PMID:  32593932] 
[44] 
Su, R.; Hu, J.; Zou, Q.; Manavalan, B.; Wei, L. Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief. Bioinform.,  2020, 21(2), 408-420.
[http://dx.doi.org/10.1093/bib/bby124] [PMID:  30649170] 
[45] 
Zhou, C.; Liu, S.; Zhang, S. Identification of amyloidogenic peptides via optimized integrated features space based on physicochemical properties and PSSM. Anal. Biochem.,  2019, 583, 113362.
[http://dx.doi.org/10.1016/j.ab.2019.113362] [PMID:  31310738] 
[46] 
Zhang, S.; Yang, K.; Lei, Y.; Song, K. iRSpot-DTS: Predict recombination spots by incorporating the dinucleotide-based spare-cross covariance information into Chou’s pseudo components. Genomics,  2019, 111(6), 1760-1770.
[http://dx.doi.org/10.1016/j.ygeno.2018.11.031] [PMID:  30529702] 
[47] 
Zhang, S.; Qiao, H. KD-KLNMF: Identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization. Anal. Biochem.,  2020, 610, 113995.
[http://dx.doi.org/10.1016/j.ab.2020.113995] [PMID:  33080214] 
[48] 
Wang, J.S.; Zhang, S.L. PA-PseU: An incremental passive-aggressive based method for identifying RNA pseudouridine sites via Chou’s 5-steps rule. Chemom. Intell. Lab. Syst.,  2021, 210, 104250.
[http://dx.doi.org/10.1016/j.chemolab.2021.104250] 
[49] 
Lv, Z.; Zhang, J.; Ding, H.; Zou, Q. RF-Pse U: A random forest predictor for RNA pseudouridine sites. Front. Bioeng. Biotechnol.,  2020, 8, 134.
[http://dx.doi.org/10.3389/fbioe.2020.00134] [PMID:  32175316] 
[50] 
Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W.; Chou, K.C. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics,  2019, 111(1), 96-102.
[http://dx.doi.org/10.1016/j.ygeno.2018.01.005] [PMID:  29360500] 
[51] 
Liu, B.; Yang, F.; Huang, D.S.; Chou, K.C. iPromoter-2L: A two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics,  2018, 34(1), 33-40.
[http://dx.doi.org/10.1093/bioinformatics/btx579] [PMID:  28968797] 
[52] 
Chen, W.; Feng, P.M.; Lin, H.; Chou, K.C. iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res.,  2013, 41(6), e68-e68.
[http://dx.doi.org/10.1093/nar/gks1450] [PMID:  23303794] 
[53] 
Lin, H.; Deng, E.Z.; Ding, H.; Chen, W.; Chou, K.C. iPro54-PseKNC: A sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res.,  2014, 42(21), 12961-12972.
[http://dx.doi.org/10.1093/nar/gku1019] [PMID:  25361964] 
[54] 
Ehsan, A.; Mahmood, K.; Khan, Y.D.; Khan, S.A.; Chou, K.C. A novel modeling in mathematical biology for classification of signal peptides. Sci. Rep.,  2018, 8(1), 1039.
[http://dx.doi.org/10.1038/s41598-018-19491-y] [PMID:  29348418] 
[55] 
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol.,  1933, 24(6), 417-441.
[http://dx.doi.org/10.1037/h0071325] 

Cite As

Current Genomics

Integrating LASSO Feature Selection and Soft Voting Classifier to Identify Origins of Replication Sites

Abstract

Graphical Abstract