Mining Gene Expression Profile with Missing Values: An Integration of Kernel PCA and Robust Singular Values Decomposition

Page: [78 - 89] Pages: 12

  • * (Excluding Mailing and Handling)

Abstract

Background: Gene expression profiling and transcriptomics provide valuable information about the role of genes that are differentially expressed between two or more samples. It is always important and challenging to analyse High-throughput DNA microarray data with a number of missing values under various experimental conditions.

Objectives: Graphical data visualizations of the expression of all genes in a particular cell provide holistic views of gene expression patterns, which improve our understanding of cellular systems under normal and pathological conditions. However, current visualization methods are sensitive to missing values, which are frequently observed in microarray-based gene expression profiling, potentially affecting the subsequent statistical analyses.

Methods: We addressed in this study the problem of missing values with respect to different imputation methods using gene expression biplot (GE biplot), one of the most popular gene visualization techniques. The effects of missing values for mining differentially expressed genes in gene expression data were evaluated using four well-known imputation methods: Robust Singular Value Decomposition (Robust SVD), Column Average (CA), Column Median (CM), and K-nearest Neighbors (KNN). Frobenius norm and absolute distances were used to measure the accuracy of the methods.

Results: Three numerical experiments were performed using simulated data (i) and publicly available colon cancer (ii) and leukemia data (iii) to analyze the performance of each method. The results showed that CM and KNN performed better than Robust SVD and CA for identifying the index gene profile in the biplot visualization in both the simulation study and the colon cancer and leukemia microarray datasets.

Conclusion: The impact of missing values on the GE biplot was smaller when the data matrix was imputed by KNN than by CM. This study concluded that KNN performed satisfactorily in generating a GE biplot in the presence of missing values in microarray data.

Keywords: Gene expression profile, simulation, GE biplot, Kernel principal component analysis, singular value decomposition.

Graphical Abstract

[1]
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17(6): 520-5.
[2]
Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics 2004; 20(6): 917-23.
[3]
Hu J, Li H, Waterman MS, Zhou XJ. Integrative missing value estimation for microarray data. BMC Bioinformatics 2006; 7(1): 449.
[4]
Pittelkow YE, Wilson SR. Visualisation of gene expression data - the GE-biplot, the Chip-plot and the Gene-plot. Stat Appl Genet Mol Biol 2003; 2: Article 6.
[5]
Pittelkow Y, Wilson SR. Use of principal component analysis and the GE-biplot for the graphical exploration of gene expression data. Biometrics 2005; 61(2): 630-2. discussion 2-4.
[6]
Reverter F, Vegas E, Sanchez P. Mining gene expression profiles: an integrated implementation of kernel principal component analysis and singular value decomposition. Genomics Proteomics Bioinformatics 2010; 8(3): 200-10.
[7]
Dıaz-Uriarte R. Supervised methods with genomic data: a review and cautionary view Data analysis and visualization in genomics and proteomics 2005:193-214
[8]
Shannon MF, McKenzie KU, Edgley A, et al. Optimizing microarray in experimental hypertension. Kidney Int 2005; 67(1): 364-70.
[9]
Xu N, Zhang G, Li J, Zhou Z. [Ecological regionalization of cotton varieties based on GGE biplot]. Ying yong sheng tai xue bao= The journal of applied ecology/Zhongguo sheng tai xue xue hui, Zhongguo ke xue yuan Shenyang ying yong sheng tai yan jiu suo zhu ban 2013; 24(3): 771-6.
[10]
Gan X, Liew AW, Yan H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res 2006; 34(5): 1608-19.
[11]
Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403(6769): 503-11.
[12]
Butte AJ, Ye J, Haring HU, Stumvoll M, White MF, Kohane IS. Determining significant fold differences in gene expression analysis. Pac Symp Biocomput 2001; 6-17.
[13]
Schwender H. Imputing missing genotypes with weighted k nearest neighbors. J Toxicol Environ Health A 2012; 75(8-10): 438-46.
[14]
Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D. Imputing missing data for gene expression arraysStanford University Statistics Department Technical report 1999.
[15]
Liu L, Hawkins DM, Ghosh S, Young SS. Robust singular value decomposition analysis of microarray data. Proc Natl Acad Sci USA 2003; 100(23): 13167-72.
[16]
Wang H, Chiu CC, Wu YC, Wu WS. Shrinkage regression-based methods for microarray missing value imputation. BMC Syst Biol 2013; 7(Suppl. 6): S11.
[17]
Hourani M, Emary IMM. Microarray missing values imputation methods: Critical analysis review. Comput Sci Inf Syst 2009; 6(2): 165-90.
[18]
Dembélé D, Kastner P. Fold change rank ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinformatics 2014; 15(1): 14.
[19]
Dembélé D. A flexible microarray data simulation model. Microarrays 2013; 2(2): 115-30.
[20]
Pochet N, De Smet F, Suykens JA, De Moor BL. Systematic benchmarking of microarray data classification: Assessing the role of non-linearity and dimensionality reduction. Bioinformatics 2004; 20(17): 3185-95.
[21]
Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286(5439): 531-7.
[22]
Kim H, Golub GH, Park H. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005; 21(2): 187-98.
[23]
Zweiger G. Knowledge discovery in gene-expression-microarray data: mining the information output of the genome. Trends Biotechnol 1999; 17(11): 429-36.
[24]
Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999; 96(12): 6745-50.
[25]
Oh S, Kang DD, Brock GN, Tseng GC. Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 2011; 27(1): 78-86.
[26]
de Souto MC, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 2015; 16(1): 64.
[27]
Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003; 19(16): 2088-96.