Abstract
Background: Inferring feature importance is both a promise and challenge in bioinformatics
and computational biology. While multiple biological computation methods exist to identify decisive
factors of single cell subpopulation, there is a need for a comprehensive toolkit that presents an intuitive
and custom view of the feature importance.
Objective: We developed a Feature-scML, a scalable and friendly toolkit that allows the users to visualize
and reveal decisive factors for single-cell omics analysis.
Methods: Feature-scML incorporates the following three main functions: (i) There are seven feature selection
algorithms to comprehensively score and rank every feature. (ii) Four machine learning approaches
and increment feature selection (IFS) strategy jointly determine the number of selected features.
(iii) The Feature-scML supports the visualized feature importance, model performance evaluation,
and model interpretation. The source code is available at https://github.com/liameihao/Feature-scML.
Results: We systematically compared the performance of seven feature selection algorithms from Feature-
scML on two single cell transcriptome datasets. It demonstrates the effectiveness and power of the
Feature-scML.
Conclusion: Feature-scML is effective for analyzing single-cell RNA omics datasets to automate the
machine learning process and customize the visual analysis from the results.
Keywords:
Feature ranking, bioinformatics, machine learning, python, feature selection, visualization.
Graphical Abstract
[6]
Liu B, Gao X, Zhang H. BioSeq-Analysis 2. 0: An updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res 2019; 47(20): e127.
[19]
Qi R, Wu J, Guo F, Xu L, Zou Q. A spectral clustering with self-weighted multiple kernel learning method for single-cell RNA-seq data. Brief Bioinform 2021; 22(4): bbaa216.
[22]
Chen Y-W, Lin C-J. Combining SVMs with various feature selection strategies.Feature extraction. Springer 2006; pp. 315-24.
[23]
Mishra D, Dash R, Rath AK, Acharya M. Feature selection in gene expression data using principal component analysis and rough set theory. Adv Exp Med Biol. 2011; 696: pp. 91-100.
[25]
Wei L, Hu J, Li F, Song J, Su R, Zou Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Briefings Bioinform 2018; 10.
[28]
Reshef DN, Reshef YA, Finucane HK, et al. Detecting novel associations in large data sets. Science 2011; 334(6062): 1518-24.
[31]
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res 2011; 12: 2825-30.
[32]
Lundberg SM, Lee S-I, Eds. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems NIPS’17: Proceedings of the 31st international conference on neural information processing systems. 2017 December; 4768-77.
[33]
Cheng S, Pei Y, He L, Peng G, Reinius B, Tam PP, et al. Single-cell RNA-seq reveals cellular heterogeneity of pluripotency transition and X chromosome dynamics during early mouse development. Cell Rep 2019; 26(10): 2593-607.