Comparison of Gene Selection Methods for Clustering Single-cell RNA-seq Data

Page: [1 - 11] Pages: 11

  • * (Excluding Mailing and Handling)

Abstract

Background: In single-cell RNA-seq data, clustering methods are employed to identify cell types to understand cell-differentiation and development. Because clustering methods are sensitive to the high dimensionality of single-cell RNA-seq data, one effective solution is to select a subset of genes in order to reduce the dimensionality. Numerous methods, with different underlying assumptions, have been proposed for choosing a subset of genes to be used for clustering.

Objective: To guide users in selecting suitable gene selection methods, we give an overview of different gene selection methods and compare their performance in terms of the differences between the selected gene sets, clustering performance, running time, and stability.

Results: We first review the data preprocessing strategies and gene selection methods in analyzing single-cell RNA-seq data. Then, the overlaps among the gene sets selected by different methods are analyzed and the clustering performance based on different feature gene sets is compared. The analysis reveals that the gene sets selected by the methods based on highly variable genes and high mean genes are most similar, and the highly variable genes play an important role in clustering. Additionally, a small number of selected genes would compromise the clustering performance, such as SCMarker selected fewer genes than other methods, leading to a poorer clustering performance than M3Drop.

Conclusion: Different gene selection methods perform differently in different scenarios. HVG works well on the full-transcript sequencing datasets, NBDrop and HMG perform better on the 3’ end sequencing datasets, M3Drop and HMG are more suitable for big datasets, and SCMarker is most consistent in different preprocessing methods.

Graphical Abstract

[1]
Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform 2020; 21(4): 1209-23.
[http://dx.doi.org/10.1093/bib/bbz063] [PMID: 31243426]
[2]
Torgerson WS. Multidimensional scaling: I. Theory and method. Psychometrika 1952; 17(4): 401-19.
[http://dx.doi.org/10.1007/BF02288916]
[3]
Laurens VDM, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(2605): 2579-605.
[4]
Hotelling H. Relations between 2 sets of variants. Biometrika 1935; 28(3/4): 312-77.
[5]
Blei DM, Ng AY, Jordan MI, Lafferty J. Latent dirichlet allocation. J Mach Learn Res 2012; 3: 993-1022.
[6]
Kohonen T. The self-organizing map. Neurocomputing 1998; 21(1-3): 1-6.
[http://dx.doi.org/10.1016/S0925-2312(98)00030-7]
[7]
Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 2015; 16(1): 241.
[http://dx.doi.org/10.1186/s13059-015-0805-z] [PMID: 26527291]
[8]
Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 2018; 9(1): 284.
[http://dx.doi.org/10.1038/s41467-017-02554-5] [PMID: 29348443]
[9]
Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinform 2020; 21(6): 2011-30.
[http://dx.doi.org/10.1093/bib/bbz138] [PMID: 31792509]
[10]
Mo Q, Wang S, Seshan VE, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci USA 2013; 110(11): 4245-50.
[http://dx.doi.org/10.1073/pnas.1208949110] [PMID: 23431203]
[11]
Meng C, Helm D, Frejno M, Kuster B. moCluster: Identifying joint patterns across multiple omics data sets. J Proteome Res 2016; 15(3): 755-65.
[http://dx.doi.org/10.1021/acs.jproteome.5b00824] [PMID: 26653205]
[12]
Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics 2014; 15(3): 569-83.
[http://dx.doi.org/10.1093/biostatistics/kxu001] [PMID: 24550197]
[13]
Ramazzotti D, Lal A, Wang B, Batzoglou S, Sidow A. Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun 2018; 9(1): 4453.
[http://dx.doi.org/10.1038/s41467-018-06921-8] [PMID: 30367051]
[14]
Chalise P, Fridley BL. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS One 2017; 12(5): e0176278.
[http://dx.doi.org/10.1371/journal.pone.0176278] [PMID: 28459819]
[15]
Meng C, Kuster B, Culhane AC, Gholami AM. A multivariate approach to the integration of multi-omics datasets. BMC Bioinformatics 2014; 15(1): 162.
[http://dx.doi.org/10.1186/1471-2105-15-162] [PMID: 24884486]
[16]
Tenenhaus A, Tenenhaus M, Groenen PJF. Regularized generalized canonical correlation analysis. Psychometrika 2011; 76(2): 257-84.
[http://dx.doi.org/10.1007/s11336-011-9206-8] [PMID: 28536930]
[17]
Kiselev VY, Kirschner K, Schaub MT, et al. SC3: Consensus clustering of single-cell RNA-seq data. Nat Methods 2017; 14(5): 483-6.
[http://dx.doi.org/10.1038/nmeth.4236] [PMID: 28346451]
[18]
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 2015; 33(5): 495-502.
[http://dx.doi.org/10.1038/nbt.3192] [PMID: 25867923]
[19]
Edsgärd D, Johnsson P, Sandberg R. Identification of spatial expression trends in single-cell gene expression data. Nat Methods 2018; 15(5): 339-42.
[http://dx.doi.org/10.1038/nmeth.4634] [PMID: 29553578]
[20]
Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000 Res 2018; 7: 1141.
[http://dx.doi.org/10.12688/f1000research.15666.2] [PMID: 30271584]
[21]
Andrews TS, Hemberg M, Birol I. M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 2019; 35(16): 2865-7.
[http://dx.doi.org/10.1093/bioinformatics/bty1044] [PMID: 30590489]
[22]
Wang F, Liang S, Kumar T, Navin N, Chen K. SCMarker: Ab initio marker selection for single cell transcriptome profiling. PLOS Comput Biol 2019; 15(10)e1007445
[http://dx.doi.org/10.1371/journal.pcbi.1007445] [PMID: 31658262]
[23]
Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 2018; 15(4): 255-61.
[http://dx.doi.org/10.1038/nmeth.4612] [PMID: 29481549]
[24]
Goolam M, Scialdone A, Graham SJL, et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Obstet Gynecol Surv 2016; 165(1): 61-74.
[PMID: 27015307]
[25]
Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 2014; 343(6167): 193-6.
[http://dx.doi.org/10.1126/science.1245316] [PMID: 24408435]
[26]
Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 2014; 32(4): 381-6.
[http://dx.doi.org/10.1038/nbt.2859] [PMID: 24658644]
[27]
Wang YJ, Schug J, Won KJ, et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 2016; 65(10): 3028-38.
[http://dx.doi.org/10.2337/db16-0405] [PMID: 27364731]
[28]
Wallrapp A, Riesenfeld SJ, Burkett PR, et al. The neuropeptide NMU amplifies ILC2-driven allergic lung inflammation. Nature 2017; 549(7672): 351-6.
[http://dx.doi.org/10.1038/nature24029] [PMID: 28902842]
[29]
Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 2014; 344(6190): 1396-401.
[http://dx.doi.org/10.1126/science.1254257] [PMID: 24925914]
[30]
Haber AL, Biton M, Rogel N, et al. A single-cell survey of the small intestinal epithelium. Nature 2017; 551(7680): 333-9.
[http://dx.doi.org/10.1038/nature24489] [PMID: 29144463]
[31]
Petropoulos S, Edsgärd D, Reinius B, et al. Single-Cell RNA-Seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell 2016; 165(4): 1012-26.
[http://dx.doi.org/10.1016/j.cell.2016.03.023] [PMID: 27062923]
[32]
Tasic B, Menon V, Nguyen TN, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci 2016; 19(2): 335-46.
[http://dx.doi.org/10.1038/nn.4216] [PMID: 26727548]
[33]
Sala Frigerio C, Wolfs L, Fattorelli N, et al. The major risk factors for Alzheimer’s Disease: Age, sex, and genes modulate the microglia response to Aβ plaques. Cell Rep 2019; 27(4): 1293-1306.e6.
[http://dx.doi.org/10.1016/j.celrep.2019.03.099] [PMID: 31018141]
[34]
Liu W, Liu X, Wang C, et al. Identification of key factors conquering developmental arrest of somatic cell cloned embryos by combining embryo biopsy and single-cell sequencing. Cell Discov 2016; 2(1): 16010.
[http://dx.doi.org/10.1038/celldisc.2016.10] [PMID: 27462457]
[35]
Kimmerling RJ, Lee Szeto G, Li JW, et al. A microfluidic platform enabling single-cell RNA-seq of multigenerational lineages. Nat Commun 2016; 7(1): 10220.
[http://dx.doi.org/10.1038/ncomms10220] [PMID: 26732280]
[36]
Grover A, Sanjuan-Pla A, Thongjuea S, et al. Single-cell RNA sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells. Nat Commun 2016; 7(1): 11075.
[http://dx.doi.org/10.1038/ncomms11075] [PMID: 27009448]
[37]
Song Y, Botvinnik OB, Lovci MT, et al. Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Mol Cell 2017; 67(1): 148-161.e5.
[http://dx.doi.org/10.1016/j.molcel.2017.06.003] [PMID: 28673540]
[38]
Darmanis S, Sloan SA, Zhang Y, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci USA 2015; 112(23): 7285-90.
[http://dx.doi.org/10.1073/pnas.1507125112] [PMID: 26060301]
[39]
Vento-Tormo R, Efremova M, Botting RA, et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature 2018; 563(7731): 347-53.
[http://dx.doi.org/10.1038/s41586-018-0698-6] [PMID: 30429548]
[40]
Macosko EZ, Basu A, Satija R, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 2015; 161(5): 1202-14.
[http://dx.doi.org/10.1016/j.cell.2015.05.002] [PMID: 26000488]
[41]
Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 2015; 161(5): 1187-201.
[http://dx.doi.org/10.1016/j.cell.2015.04.044] [PMID: 26000487]
[42]
Han X, Wang R, Zhou Y, et al. Mapping the mouse cell atlas by microwell-seq. Cell 2018; 172(5): 1091-1107.e17.
[http://dx.doi.org/10.1016/j.cell.2018.02.001] [PMID: 29474909]
[43]
Grün D, Muraro MJ, Boisset JC, et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 2016; 19(2): 266-77.
[http://dx.doi.org/10.1016/j.stem.2016.05.010] [PMID: 27345837]
[44]
Cao J, Packer JS, Ramani V, et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 2017; 357(6352): 661-7.
[http://dx.doi.org/10.1126/science.aam8940] [PMID: 28818938]
[45]
Spallanzani RG, Zemmour D, Xiao T, et al. Distinct immunocyte-promoting and adipocyte-generating stromal components coordinate adipose tissue immune and metabolic tenors. Sci Immunol 2019; 4(35)eaaw3658
[http://dx.doi.org/10.1126/sciimmunol.aaw3658] [PMID: 31053654]
[46]
Zemmour D, Zilionis R, Kiner E, Klein AM, Mathis D, Benoist C. Single-cell gene expression reveals a landscape of regulatory T cell phenotypes shaped by the TCR. Nat Immunol 2018; 19(3): 291-301.
[http://dx.doi.org/10.1038/s41590-018-0051-0] [PMID: 29434354]
[47]
Shekhar K, Lapan SW, Whitney IE, et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 2016; 166(5): 1308-1323.e30.
[http://dx.doi.org/10.1016/j.cell.2016.07.054] [PMID: 27565351]
[48]
Estévez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009; 20(2): 189-201.
[http://dx.doi.org/10.1109/TNN.2008.2005601] [PMID: 19150792]
[49]
Hubert L, Arabie P. Comparing partitions. J Classif 1985; 2(1): 193-218.
[http://dx.doi.org/10.1007/BF01908075]