Feature-scML: An Open-source Python Package for the Feature Importance
Visualization of Single-Cell Omics with Machine Learning

Pengfei      Liang; Hao      Wang; Yuchao      Liang; Jian      Zhou; Haicheng      Li; Yongchun      Zuo

Abstract

Background: Inferring feature importance is both a promise and challenge in bioinformatics and computational biology. While multiple biological computation methods exist to identify decisive factors of single cell subpopulation, there is a need for a comprehensive toolkit that presents an intuitive and custom view of the feature importance.

Objective: We developed a Feature-scML, a scalable and friendly toolkit that allows the users to visualize and reveal decisive factors for single-cell omics analysis.

Methods: Feature-scML incorporates the following three main functions: (i) There are seven feature selection algorithms to comprehensively score and rank every feature. (ii) Four machine learning approaches and increment feature selection (IFS) strategy jointly determine the number of selected features. (iii) The Feature-scML supports the visualized feature importance, model performance evaluation, and model interpretation. The source code is available at https://github.com/liameihao/Feature-scML.

Results: We systematically compared the performance of seven feature selection algorithms from Feature- scML on two single cell transcriptome datasets. It demonstrates the effectiveness and power of the Feature-scML.

Conclusion: Feature-scML is effective for analyzing single-cell RNA omics datasets to automate the machine learning process and customize the visual analysis from the results.

Keywords: Feature ranking, bioinformatics, machine learning, python, feature selection, visualization.

Graphical Abstract

[1]
He S, Guo F, Zou Q, Ding H. MRMD2. 0: A python tool for machine learning with feature ranking and reduction. Curr Bioinform  2020; 15(10): 1213-21.
 [http://dx.doi.org/10.2174/1574893615999200503030350]

[2]
Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A. FeatureSelect: A software for feature selection based on machine learning approaches. BMC Bioinformatics  2019; 20(1): 170.
 [http://dx.doi.org/10.1186/s12859-019-2754-0] [PMID: 30943889]

[3]
Petropoulos S, Edsgärd D, Reinius B, et al. Single-cell RNA-seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell  2016; 165(4): 1012-26.
 [http://dx.doi.org/10.1016/j.cell.2016.03.023] [PMID: 27062923]

[4]
Nam AS, Chaligne R, Landau DA. Integrating genetic and non-genetic determinants of cancer evolution by single-cell multi-omics. Nat Rev Genet  2021; 22(1): 3-18.
 [http://dx.doi.org/10.1038/s41576-020-0265-5] [PMID: 32807900]

[5]
Guo F, Li L, Li J, et al. Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res  2017; 27(8): 967-88.
 [http://dx.doi.org/10.1038/cr.2017.82] [PMID: 28621329]

[6]
Liu B, Gao X, Zhang H. BioSeq-Analysis 2. 0: An updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res  2019; 47(20): e127.

[7]
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L. PseKRAAC: A flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics  2017; 33(1): 122-4.
 [http://dx.doi.org/10.1093/bioinformatics/btw564] [PMID: 27565583]

[8]
Chen W, Lei T-Y, Jin D-C, Lin H, Chou K-C. PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem  2014; 456: 53-60.
 [http://dx.doi.org/10.1016/j.ab.2014.04.001] [PMID: 24732113]

[9]
Liang P, Zheng L, Long C, Yang W, Yang L, Zuo Y. HelPredictor models single-cell transcriptome to predict human embryo lineage allocation. Brief Bioinform  2021; 22(6): bbab196.
 [http://dx.doi.org/10.1093/bib/bbab196] [PMID: 34037706]

[10]
Do DT, Le NQK. Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features. Genomics  2020; 112(3): 2445-51.
 [http://dx.doi.org/10.1016/j.ygeno.2020.01.017] [PMID: 31987913]

[11]
Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat Mach Intell  2019; 1(4): 191-8.
 [http://dx.doi.org/10.1038/s42256-019-0037-0]

[12]
Li X, Wang K, Lyu Y, et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun  2020; 11(1): 2338.
 [http://dx.doi.org/10.1038/s41467-020-15851-3] [PMID: 32393754]

[13]
Hu J, Li X, Hu G, Lyu Y, Susztak K, Li M. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell  2020; 2(10): 607-18.
 [http://dx.doi.org/10.1038/s42256-020-00233-7] [PMID: 33817554]

[14]
Huang G-H, Zhang Y-H, Chen L, Li Y, Huang T, Cai Y-D. Identifying lung cancer cell markers with machine learning methods and single-cell RNA-seq data. Life (Basel)  2021; 11(9): 940.
 [http://dx.doi.org/10.3390/life11090940] [PMID: 34575089]

[15]
Le NQK, Hung TNK, Do DT, Lam LHT, Dang LH, Huynh T-T. Radiomics-based machine learning model for efficiently classifying transcriptome subtypes in glioblastoma patients from MRI. Comput Biol Med  2021; 132: 104320.
 [http://dx.doi.org/10.1016/j.compbiomed.2021.104320] [PMID: 33735760]

[16]
Hung TNK, Le NQK, Le NH, et al. An AI‐based prediction model for drug‐drug interactions in osteoporosis and Paget’s diseases from SMILES. Mol Inform  2022; e2100264.
 [http://dx.doi.org/10.1002/minf.202100264] [PMID: 34989149]

[17]
Wang H, Liang P, Zheng L, Long C, Li H, Zuo Y. eHSCPr discriminating the cell identity involved in endothelial to hematopoietic transition. Bioinformatics  2021; 37(15): 2157-64.
 [http://dx.doi.org/10.1093/bioinformatics/btab071] [PMID: 33532815]

[18]
Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol  2021; 22(1): 301.
 [http://dx.doi.org/10.1186/s13059-021-02519-4] [PMID: 34715899]

[19]
Qi R, Wu J, Guo F, Xu L, Zou Q. A spectral clustering with self-weighted multiple kernel learning method for single-cell RNA-seq data. Brief Bioinform  2021; 22(4): bbaa216.

[20]
Wolf FA, Angerer P, Theis FJ. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol  2018; 19(1): 15.
 [http://dx.doi.org/10.1186/s13059-017-1382-0] [PMID: 29409532]

[21]
Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell  2021; 184(13): 3573-3587.e29.
 [http://dx.doi.org/10.1016/j.cell.2021.04.048] [PMID: 34062119]

[22]
Chen Y-W, Lin C-J. Combining SVMs with various feature selection strategies.Feature extraction.  Springer 2006; pp. 315-24.

[23]
Mishra D, Dash R, Rath AK, Acharya M. Feature selection in gene expression data using principal component analysis and rough set theory. Adv Exp Med Biol.   2011; 696: pp. 91-100.

[24]
Brennecke P, Anders S, Kim JK, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods  2013; 10(11): 1093-5.
 [http://dx.doi.org/10.1038/nmeth.2645] [PMID: 24056876]

[25]
Wei L, Hu J, Li F, Song J, Su R, Zou Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Briefings Bioinform  2018; 10.

[26]
Capper D, Jones DTW, Sill M, et al. DNA methylation-based classification of central nervous system tumours. Nature  2018; 555(7697): 469-74.
 [http://dx.doi.org/10.1038/nature26000] [PMID: 29539639]

[27]
Albanese D, Filosi M, Visintainer R, Riccadonna S, Jurman G, Furlanello C. Minerva and minepy: A C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics  2013; 29(3): 407-8.
 [http://dx.doi.org/10.1093/bioinformatics/bts707] [PMID: 23242262]

[28]
Reshef DN, Reshef YA, Finucane HK, et al. Detecting novel associations in large data sets. Science  2011; 334(6062): 1518-24.

[29]
Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform  2018; 85: 168-88.
 [http://dx.doi.org/10.1016/j.jbi.2018.07.015] [PMID: 30030120]

[30]
Chen W, Xing P, Zou Q. Detecting N 6-methyladenosine sites from RNA transcriptomes using ensemble support vector machines. Sci Rep  2017; 7(1): 1-8.
 [http://dx.doi.org/10.1038/srep40242] [PMID: 28127051]

[31]
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res  2011; 12: 2825-30.

[32]
Lundberg SM, Lee S-I, Eds. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems NIPS’17: Proceedings of the 31st international conference on neural information processing systems.  2017 December; 4768-77.

[33]
Cheng S, Pei Y, He L, Peng G, Reinius B, Tam PP, et al. Single-cell RNA-seq reveals cellular heterogeneity of pluripotency transition and X chromosome dynamics during early mouse development. Cell Rep  2019; 26(10): 2593-607.

[34]
Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science  2014; 343(6167): 193-6.
 [http://dx.doi.org/10.1126/science.1245316] [PMID: 24408435]

[35]
Chen L, Pan X, Zeng T, Zhang Y-H, Huang T, Cai Y-D. Identifying essential signature genes and expression rules associated with distinctive development stages of early embryonic cells. IEEE Access  2019; 7: 128570-8.
 [http://dx.doi.org/10.1109/ACCESS.2019.2939556]

[36]
Assou S, Boumela I, Haouzi D, et al. Transcriptome analysis during human trophectoderm specification suggests new roles of metabolic and epigenetic genes. PLoS One  2012; 7(6): e39306.
 [http://dx.doi.org/10.1371/journal.pone.0039306] [PMID: 22761758]

[37]
Daulhac L, Kowalski-Chauvel A, Pradayrol L, Vaysse N, Seva C. Src-family tyrosine kinases in activation of ERK-1 and p85/p110-phosphatidylinositol 3-kinase by G/CCKB receptors. J Biol Chem  1999; 274(29): 20657-63.
 [http://dx.doi.org/10.1074/jbc.274.29.20657] [PMID: 10400698]

[38]
Io S, Kabata M, Iemura Y, Semi K, Morone N, Minagawa A, et al. Capturing human trophoblast development with naive pluripotent stem cells in vitro. Cell Stem Cell  2021; 28(6): 1023-39.
 [http://dx.doi.org/10.1016/j.stem.2021.03.013]

[39]
Masoumi Z, Maes GE, Herten K, et al. Preeclampsia is associated with sex-specific transcriptional and proteomic changes in fetal erythroid cells. Int J Mol Sci  2019; 20(8): 2038.
 [http://dx.doi.org/10.3390/ijms20082038] [PMID: 31027199]

[40]
Ribeiro MT, Singh S, Guestrin C, Eds. Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York NY, USA: ACM 2016.
 [http://dx.doi.org/10.1145/2939672.2939778]

Cite As

Current Bioinformatics

Feature-scML: An Open-source Python Package for the Feature Importance Visualization of Single-Cell Omics with Machine Learning

Abstract

Graphical Abstract