Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions

Xiao-Fei       Yang; Yuan-Ke       Zhou; Lin       Zhang; Yang       Gao; Pu-Feng       Du

Abstract

Background: Long non-coding RNAs (lncRNAs) are transcripts with a length more than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown that the biological functions of lncRNAs are intimately related to their subcellular localizations. Therefore, it is very important to confirm the lncRNA subcellular localization.

Methods: In this paper, we proposed a novel method to predict the subcellular localization of lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer nucleotide composition and sequence order correlated factors of lncRNA to formulate lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support vector machine (SVM) to perform the prediction.

Results: The AUC value of the proposed method can reach 0.9695, which indicated the proposed predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore, the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross validation, which clearly outperforms the existing state-of- the-art method.

Conclusion: It is demonstrated that the proposed predictor is feasible and powerful for the prediction of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the source code at https://github.com/NicoleYXF/lncRNA.

Keywords: Long non-coding RNA, subcellular localization, sequence order correlated factors, feature selection, analysis of variance, support vector machine.

Graphical Abstract

[1] 
Chen X, You ZH, Yan GY, Gong DW. IRWRLDA: improved random walk with restart for lncRNA-disease association prediction. Oncotarget  2016; 7(36): 57919-31.
[http://dx.doi.org/10.18632/oncotarget.11141] [PMID:  27517318] 
[2] 
Ma L, Bajic VB, Zhang Z. On the classification of long non-coding RNAs. RNA Biol  2013; 10(6): 925-33.
[http://dx.doi.org/10.4161/rna.24604] [PMID:  23696037] 
[3] 
Yang X, Xie X, Xiao YF, et al. The emergence of long non-coding RNAs in the tumorigenesis of hepatocellular carcinoma. Cancer Lett  2015; 360(2): 119-24.
[http://dx.doi.org/10.1016/j.canlet.2015.02.035] [PMID:  25721084] 
[4] 
Cao J. The functional role of long non-coding RNAs and epigenetics. Biol Proced Online  2014; 16(1): 11.
[http://dx.doi.org/10.1186/1480-9222-16-11] [PMID:  25276098] 
[5] 
Fan Y, Shen B, Tan M, et al. Long non-coding RNA UCA1 increases chemoresistance of bladder cancer cells by regulating Wnt signaling. FEBS J  2014; 281(7): 1750-8.
[http://dx.doi.org/10.1111/febs.12737] [PMID:  24495014] 
[6] 
Sun M, Kraus WL. From discovery to function: the expanding roles of long noncoding RNAs in physiology and disease. Endocr Rev  2015; 36(1): 25-64.
[http://dx.doi.org/10.1210/er.2014-1034] [PMID:  25426780] 
[7] 
Fang Y, Fullwood MJ. Roles, functions, and mechanisms of long non-coding RNAs in cancer. Genomics Proteomics Bioinformatics  2016; 14(1): 42-54.
[http://dx.doi.org/10.1016/j.gpb.2015.09.006] [PMID:  26883671] 
[8] 
Schmitz SU, Grote P, Herrmann BG. Mechanisms of long noncoding RNA function in development and disease. Cell Mol Life Sci  2016; 73(13): 2491-509.
[http://dx.doi.org/10.1007/s00018-016-2174-5] [PMID:  27007508] 
[9] 
Li Y, Zhang J, Pan J, et al. Insights into the roles of lncRNAs in skeletal and dental diseases. Cell Biosci  2018; 8(1): 8.
[http://dx.doi.org/10.1186/s13578-018-0208-4] [PMID:  29441193] 
[10] 
Bhan A, Mandal SS. LncRNA HOTAIR: A master regulator of chromatin dynamics and cancer. Biochim Biophys Acta  2015; 1856(1): 151-64.
[PMID:  26208723] 
[11] 
Karlsson O, Baccarelli AA. Environmental health and long non-coding RNAs Curr Environ Heal reports   2016; 3(3): 178-87.
[http://dx.doi.org/ 10.1007/s40572-016-0092-1] 
[12] 
Cabili MN, Dunagin MC, McClanahan PD, et al. Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution. Genome Biol  2015; 16(1): 20.
[http://dx.doi.org/10.1186/s13059-015-0586-4] [PMID:  25630241] 
[13] 
Chen LL. Linking long noncoding RNA localization and function. Trends Biochem Sci  2016; 41(9): 761-72.
[http://dx.doi.org/10.1016/j.tibs.2016.07.003] [PMID:  27499234] 
[14] 
Wen X, Gao L, Guo X, et al. lncSLdb: a resource for long non-coding RNA subcellular localization. Database (Oxford)  2018; 2018: 1-6.
[http://dx.doi.org/10.1093/database/bay085] [PMID:  30219837] 
[15] 
van Heesch S, van Iterson M, Jacobi J, et al. Extensive localization of long noncoding RNAs to the cytosol and mono- and polyribosomal complexes. Genome Biol  2014; 15(1): R6.
[http://dx.doi.org/10.1186/gb-2014-15-1-r6] [PMID:  24393600] 
[16] 
Zhang T, Tan P, Wang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res  2017; 45(D1): D135-8.
[PMID:  27543076] 
[17] 
Mas-Ponte D, Carlevaro-Fita J, Palumbo E, Hermoso Pulido T, Guigo R, Johnson R. LncATLAS database for subcellular localization of long noncoding RNAs. RNA  2017; 23(7): 1080-7.
[http://dx.doi.org/10.1261/rna.060814.117] [PMID:  28386015] 
[18] 
Cao Z, Pan X, Yang Y, Huang Y, Shen H-B. The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics  2018; 34(13): 2185-94.
[http://dx.doi.org/10.1093/bioinformatics/bty085] [PMID:  29462250] 
[19] 
Su ZD, Huang Y, Zhang ZY, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics  2018; 34(24): 4196-204.
[http://dx.doi.org/10.1093/bioinformatics/bty508] [PMID:  29931187] 
[20] 
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins  2001; 43(3): 246-55.
[http://dx.doi.org/10.1002/prot.1035] [PMID:  11288174] 
[21] 
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics  2005; 21(1): 10-9.
[http://dx.doi.org/10.1093/bioinformatics/bth466] [PMID:  15308540] 
[22] 
Huang C, Yuan J-Q. Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions. J Theor Biol  2013; 335: 205-12.
[http://dx.doi.org/10.1016/j.jtbi.2013.06.034] [PMID:  23850480] 
[23] 
Du P, Yu Y. SubMito-PSPCP: predicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions. BioMed Res Int  2013; 2013263829
[http://dx.doi.org/10.1155/2013/263829] [PMID:  24027753] 
[24] 
Mondal S, Pai PP. Chou’s pseudo amino acid composition improves sequence-based antifreeze protein prediction. J Theor Biol  2014; 356: 30-5.
[http://dx.doi.org/10.1016/j.jtbi.2014.04.006] [PMID:  24732262] 
[25] 
Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One  2014; 9(8)e105018
[http://dx.doi.org/10.1371/journal.pone.0105018] [PMID:  25121969] 
[26] 
Jia J, Liu Z, Xiao X, Liu B, Chou KC. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem  2016; 497: 48-56.
[http://dx.doi.org/10.1016/j.ab.2015.12.009] [PMID:  26723495] 
[27] 
Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst  2016; 12(4): 1269-75.
[http://dx.doi.org/10.1039/C5MB00883B] [PMID:  26883492] 
[28] 
Yang H, Tang H, Chen X-X, et al. Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition. BioMed Res Int  2016; 20165413903
[http://dx.doi.org/10.1155/2016/5413903] [PMID:  27597968] 
[29] 
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol  2016; 391: 35-42.
[http://dx.doi.org/10.1016/j.jtbi.2015.11.009] [PMID:  26702543] 
[30] 
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol  2011; 273(1): 236-47.
[http://dx.doi.org/10.1016/j.jtbi.2010.12.024] [PMID:  21168420] 
[31] 
Chen W, Lei TY, Jin DC, Lin H, Chou KC. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem  2014; 456: 53-60.
[http://dx.doi.org/10.1016/j.ab.2014.04.001] [PMID:  24732113] 
[32] 
Chen W, Lin H, Chou KC. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol Biosyst  2015; 11(10): 2620-34.
[http://dx.doi.org/10.1039/C5MB00155B] [PMID:  26099739] 
[33] 
Chen W, Feng P-M, Lin H, Chou K-C. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BioMed Res Int  2014; 2014623149
[http://dx.doi.org/10.1155/2014/623149] [PMID:  24967386] 
[34] 
Chen W, Feng P, Ding H, Lin H, Chou KC. iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal Biochem  2015; 490: 26-33.
[http://dx.doi.org/10.1016/j.ab.2015.08.021] [PMID:  26314792] 
[35] 
Chen W, Xing P, Zou Q. Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines. Sci Rep  2017; 7: 40242.
[http://dx.doi.org/10.1038/srep40242] [PMID:  28079126] 
[36] 
Chen W, Ding H, Zhou X, Lin H, Chou KC. iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Anal Biochem  2018; 561-562: 59-65.
[http://dx.doi.org/10.1016/j.ab.2018.09.002] [PMID:  30201554] 
[37] 
Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics  2016; 32(3): 362-9.
[http://dx.doi.org/10.1093/bioinformatics/btv604] [PMID:  26476782] 
[38] 
Liu B, Long R, Chou KC. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics  2016; 32(16): 2411-8.
[http://dx.doi.org/10.1093/bioinformatics/btw186] [PMID:  27153623] 
[39] 
Li WC, Deng EZ, Ding H, Chen W, Lin H. iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom Intell Lab Syst  2015; 141: 100-6.
[http://dx.doi.org/10.1016/j.chemolab.2014.12.011] 
[40] 
Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou K-C. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget  2016; 7(43): 69783-93.
[http://dx.doi.org/10.18632/oncotarget.11975] [PMID:  27626500] 
[41] 
Liu B, Yang F, Chou KC. 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucleic Acids  2017; 7: 267-77.
[http://dx.doi.org/10.1016/j.omtn.2017.04.008] [PMID:  28624202] 
[42] 
Yang H, Lv H, Ding H, Chen W, Lin H. Irna-2om: a sequence-based predictor for identifying 2′-O-methylation sites in homo sapiens. J Comput Biol  2018; 25(11): 1266-77.
[http://dx.doi.org/10.1089/cmb.2018.0004] [PMID:  30113871] 
[43] 
Feng P, Zhang J, Tang H, Chen W, Lin H. Predicting the Organelle Location of Noncoding RNAs Using Pseudo Nucleotide Compositions. Interdiscip Sci  2017; 9(4): 540-4.
[http://dx.doi.org/10.1007/s12539-016-0193-4] [PMID:  27739055] 
[44] 
Liu B, Liu F, Fang L, Wang X, Chou KC. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics  2015; 31(8): 1307-9.
[http://dx.doi.org/10.1093/bioinformatics/btu820] [PMID:  25504848] 
[45] 
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res  2015; 43(W1)W65-71
[http://dx.doi.org/10.1093/nar/gkv458] [PMID:  25958395] 
[46] 
Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Mol Genet Genomics  2016; 291(1): 473-81.
[http://dx.doi.org/10.1007/s00438-015-1078-7] [PMID:  26085220] 
[47] 
Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget  2017; 8(8): 13338-43.
[http://dx.doi.org/10.18632/oncotarget.14524] [PMID:  28076851] 
[48] 
Liu B, Wu H, Chou KC. Pse-in-One 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat Sci  2017; 9(04): 67.
[http://dx.doi.org/10.4236/ns.2017.94007] 
[49] 
Chou KC. Impacts of bioinformatics to medicinal chemistry. Med Chem  2015; 11(3): 218-34.
[http://dx.doi.org/10.2174/1573406411666141229162834 PMID: 25548930] 
[50] 
He W, Ju Y, Zeng X, Liu X, Zou Q. Sc-ncDNAPred: A sequence-based predictor for identifying non-coding DNA in Saccharomyces cerevisiae. Front Microbiol  2018; 9: 2174.
[http://dx.doi.org/10.3389/fmicb.2018.02174] [PMID:  30258427] 
[51] 
Kim CS, Winn MD, Sachdeva V, Jordan KE. K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity. BMC Bioinformatics  2017; 18(1): 467.
[http://dx.doi.org/10.1186/s12859-017-1881-8] [PMID:  29100493] 
[52] 
Matias Rodrigues JF, Schmidt TSB, Tackmann J, von Mering C. MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics  2017; 33(23): 3808-10.
[http://dx.doi.org/10.1093/bioinformatics/btx517] [PMID:  28961926] 
[53] 
Zhu-Hong Y. MengChu Z, Xin L, Shuai L. Highly efficient framework for predicting interactions between proteins. IEEE Trans Cybern  2017; 47(3): 731-43.
[http://dx.doi.org/10.1109/TCYB.2016.2524994] [PMID:  28113829] 
[54] 
Liu Q, Gan M, Jiang R. A sequence-based method to predict the impact of regulatory variants using random forest. BMC Syst Biol  2017; 11(2): 7.
[http://dx.doi.org/10.1186/s12918-017-0389-1] [PMID:  28361702] 
[55] 
Liu Z, Xiao X, Yu DJ, Jia J, Qiu WR, Chou KC. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. Anal Biochem  2016; 497: 60-7.
[http://dx.doi.org/10.1016/j.ab.2015.12.017] [PMID:  26748145] 
[56] 
Zhu PP, Li WC, Zhong ZJ, et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol Biosyst  2015; 11(2): 558-63.
[http://dx.doi.org/10.1039/C4MB00645C] [PMID:  25437899] 
[57] 
Anderson MJ. A new method for non-parametric multivariate analysis of variance. Austral Ecol  2001; 26(1): 32-46.
[58] 
Müller AC, Guido S. Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Inc 2016.
[59] 
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res  2011; 12: 2825-30.
[60] 
Cheng JH, Yang H, Liu ML, et al. Prediction of bacteriophage proteins located in the host cell using hybrid features. Chemom Intell Lab Syst  2018; 180: 64-9.
[http://dx.doi.org/10.1016/j.chemolab.2018.07.006] 
[61] 
Chou KC, Zhang CT. Prediction of protein structural classes. Crit Rev Biochem Mol Biol  1995; 30(4): 275-349.
[http://dx.doi.org/10.3109/10409239509083488] [PMID:  7587280] 
[62] 
Cheng X, Zhao SG, Lin WZ, Xiao X, Chou KC. pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics  2017; 33(22): 3524-31.
[http://dx.doi.org/10.1093/bioinformatics/btx476] [PMID:  29036535] 
[63] 
Xiao X, Cheng X, Su S, Mao Q, Chou K-CJNS. pLoc-mGpos: incorporate key gene ontology information into general PseAAC for predicting subcellular localization of Gram-positive bacterial proteins. Nat Sci  2017; 9(9): 330.
[http://dx.doi.org/10.4236/ns.2017.99032] 
[64] 
Feng PM, Chen W, Lin H, Chou KC. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem  2013; 442(1): 118-25.
[http://dx.doi.org/10.1016/j.ab.2013.05.024] [PMID:  23756733] 
[65] 
Liu B, Fang L, Chen J, Liu F, Wang X. miRNA-dis: microRNA precursor identification based on distance structure status pairs. Mol Biosyst  2015; 11(4): 1194-204.
[http://dx.doi.org/10.1039/C5MB00050E] [PMID:  25715848] 
[66] 
Chou KC. Some remarks on predicting multi-label attributes in molecular biosystems. Mol Biosyst  2013; 9(6): 1092-100.
[http://dx.doi.org/10.1039/c3mb25555g] [PMID:  23536215] 
[67] 
Cheng X, Zhao SG, Xiao X, Chou KC. iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics  2017; 33(3): 341-6.
[http://dx.doi.org/10.1093/bioinformatics/btx387] [PMID:  28172617] 
[68] 
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit  1997; 30(7): 1145-59.
[http://dx.doi.org/10.1016/S0031-3203(96)00142-2] 
[69] 
Xu ZC, Jiang SY, Qiu WR, Liu YC, Xiao X. iDHSs-PseTNC: Identifying DNase I Hypersensitive Sites with Pseuo Trinucleotide Component by Deep Sparse Auto-Encoder. Lett Org Chem  2017; 14(9): 655-64.
[http://dx.doi.org/10.2174/1570178614666170213102455] 
[70] 
Pérez A, Noy A, Lankas F, Luque FJ, Orozco M. The relative flexibility of B-DNA and A-RNA duplexes: database analysis. Nucleic Acids Res  2004; 32(20): 6144-51.
[http://dx.doi.org/10.1093/nar/gkh954] [PMID:  15562006] 
[71] 
Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biol  2007; 8(12): R263.
[http://dx.doi.org/10.1186/gb-2007-8-12-r263] [PMID:  18072969] 
[72] 
Freier SM, Kierzek R, Jaeger JA, et al. Improved free-energy parameters for predictions of RNA duplex stability. Proc Natl Acad Sci USA  1986; 83(24): 9373-7.
[http://dx.doi.org/10.1073/pnas.83.24.9373] [PMID:  2432595] 

Cite As

Current Bioinformatics

Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions

Abstract

Graphical Abstract