Mining Gene Expression Profile with Missing Values: An Integration of Kernel PCA and Robust Singular Values Decomposition

Md.    Saimul     Islam; Md.    Aminul     Hoque; Md.     Sahidul    Islam; Mohammad        Ali; Md.   Bipul      Hossen; Md.         Binyamin; Amir   Feisal    Merican; Kohei         Akazawa; Nishith         Kumar; Masahiro        Sugimoto

Abstract

Background: Gene expression profiling and transcriptomics provide valuable information about the role of genes that are differentially expressed between two or more samples. It is always important and challenging to analyse High-throughput DNA microarray data with a number of missing values under various experimental conditions.

Objectives: Graphical data visualizations of the expression of all genes in a particular cell provide holistic views of gene expression patterns, which improve our understanding of cellular systems under normal and pathological conditions. However, current visualization methods are sensitive to missing values, which are frequently observed in microarray-based gene expression profiling, potentially affecting the subsequent statistical analyses.

Methods: We addressed in this study the problem of missing values with respect to different imputation methods using gene expression biplot (GE biplot), one of the most popular gene visualization techniques. The effects of missing values for mining differentially expressed genes in gene expression data were evaluated using four well-known imputation methods: Robust Singular Value Decomposition (Robust SVD), Column Average (CA), Column Median (CM), and K-nearest Neighbors (KNN). Frobenius norm and absolute distances were used to measure the accuracy of the methods.

Results: Three numerical experiments were performed using simulated data (i) and publicly available colon cancer (ii) and leukemia data (iii) to analyze the performance of each method. The results showed that CM and KNN performed better than Robust SVD and CA for identifying the index gene profile in the biplot visualization in both the simulation study and the colon cancer and leukemia microarray datasets.

Conclusion: The impact of missing values on the GE biplot was smaller when the data matrix was imputed by KNN than by CM. This study concluded that KNN performed satisfactorily in generating a GE biplot in the presence of missing values in microarray data.

Keywords: Gene expression profile, simulation, GE biplot, Kernel principal component analysis, singular value decomposition.

Graphical Abstract

[1] 
Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics  2001; 17(6): 520-5.
[2] 
Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics  2004; 20(6): 917-23.
[3] 
Hu J, Li H, Waterman MS, Zhou XJ. Integrative missing value estimation for microarray data. BMC Bioinformatics  2006; 7(1): 449.
[4] 
Pittelkow YE, Wilson SR. Visualisation of gene expression data - the GE-biplot, the Chip-plot and the Gene-plot. Stat Appl Genet Mol Biol  2003; 2: Article 6.
[5] 
Pittelkow Y, Wilson SR. Use of principal component analysis and the GE-biplot for the graphical exploration of gene expression data. Biometrics  2005; 61(2): 630-2. discussion 2-4.
[6] 
Reverter F, Vegas E, Sanchez P. Mining gene expression profiles: an integrated implementation of kernel principal component analysis and singular value decomposition. Genomics Proteomics Bioinformatics  2010; 8(3): 200-10.
[7] 
Dıaz-Uriarte R. Supervised methods with genomic data: a review
and cautionary view Data analysis and visualization in genomics
and proteomics 2005:193-214 
[8] 
Shannon MF, McKenzie KU, Edgley A, et al. Optimizing microarray in experimental hypertension. Kidney Int  2005; 67(1): 364-70.
[9] 
Xu N, Zhang G, Li J, Zhou Z. [Ecological regionalization of cotton
varieties based on GGE biplot]. Ying yong sheng tai xue bao= The
journal of applied ecology/Zhongguo sheng tai xue xue hui,
Zhongguo ke xue yuan Shenyang ying yong sheng tai yan jiu suo
zhu ban 2013; 24(3): 771-6. 
[10] 
Gan X, Liew AW, Yan H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res  2006; 34(5): 1608-19.
[11] 
Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature  2000; 403(6769): 503-11.
[12] 
Butte AJ, Ye J, Haring HU, Stumvoll M, White MF, Kohane IS. Determining significant fold differences in gene expression analysis. Pac Symp Biocomput  2001; 6-17.
[13] 
Schwender H. Imputing missing genotypes with weighted k nearest neighbors. J Toxicol Environ Health A  2012; 75(8-10): 438-46.
[14] 
Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D. Imputing missing data for gene expression arraysStanford University Statistics Department Technical report 1999.
[15] 
Liu L, Hawkins DM, Ghosh S, Young SS. Robust singular value decomposition analysis of microarray data. Proc Natl Acad Sci USA  2003; 100(23): 13167-72.
[16] 
Wang H, Chiu CC, Wu YC, Wu WS. Shrinkage regression-based methods for microarray missing value imputation. BMC Syst Biol  2013; 7(Suppl. 6): S11.
[17] 
Hourani M, Emary IMM. Microarray missing values imputation methods: Critical analysis review. Comput Sci Inf Syst  2009; 6(2): 165-90.
[18] 
Dembélé D, Kastner P. Fold change rank ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinformatics  2014; 15(1): 14.
[19] 
Dembélé D. A flexible microarray data simulation model. Microarrays  2013; 2(2): 115-30.
[20] 
Pochet N, De Smet F, Suykens JA, De Moor BL. Systematic benchmarking of microarray data classification: Assessing the role of non-linearity and dimensionality reduction. Bioinformatics  2004; 20(17): 3185-95.
[21] 
Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science  1999; 286(5439): 531-7.
[22] 
Kim H, Golub GH, Park H. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics  2005; 21(2): 187-98.
[23] 
Zweiger G. Knowledge discovery in gene-expression-microarray data: mining the information output of the genome. Trends Biotechnol  1999; 17(11): 429-36.
[24] 
Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA  1999; 96(12): 6745-50.
[25] 
Oh S, Kang DD, Brock GN, Tseng GC. Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics  2011; 27(1): 78-86.
[26] 
de Souto MC, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics  2015; 16(1): 64.
[27] 
Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics  2003; 19(16): 2088-96.

Cite As

Current Bioinformatics

Mining Gene Expression Profile with Missing Values: An Integration of Kernel PCA and Robust Singular Values Decomposition

Abstract

Graphical Abstract