Formalization and Semantic Integration of Heterogeneous Omics Annotations for Exploratory Searches

Page: [162 - 178] Pages: 17

  • * (Excluding Mailing and Handling)

Abstract

Aim: To facilitate researchers and practitioners for unveiling the mysterious functional aspects of the human cellular system through performing exploratory searching on semantically integrated heterogeneous and geographically dispersed omics annotations.

Background: Improving health standards of life is one of the motives, which continuously instigates researchers and practitioners to strive for uncovering the mysterious aspects of the human cellular system. Inferring new knowledge from known facts always requires a reasonably large amount of data in well-structured, integrated, and unified form. Due to the advent of especially high throughput and sensor technologies, biological data is growing heterogeneously and geographically at an astronomical rate. Several data integration systems have been deployed to cope with the issues of data heterogeneity and global dispersion. Systems based on semantic data integration models are more flexible and expandable than syntax-based ones but still lack aspect-based data integration, persistence and querying. Furthermore, these systems do not fully support to warehouse biological entities in the form of semantic associations as naturally possessed by the human cell.

Objective: To develop an aspect-oriented formal data integration model for semantically integrating heterogeneous and geographically dispersed omics annotations for providing exploratory querying on integrated data.

Methods: We propose an aspect-oriented formal data integration model that uses web semantics standards to formally specify its every construct. The proposed model supports the aspect-oriented representation of biological entities while addressing the issues of data heterogeneity and global dispersion. It associates and warehouses biological entities in the way they relate with each other in a physical cell system.

Result: To show the significance of proposed model, we developed a data warehouse and information retrieval system based on proposed model compliant multi-layered and multi-modular software architecture. Results show that our model supports well for gathering, associating, integrating, persisting and querying each entity with respect to its all possible aspects within or across the various associated omics layers.

Conclusion: Formal specifications better facilitate for addressing data integration issues by providing formal means for understanding omics data based on meaning instead of syntax

Keywords: Formal specifications, semantic data schema, omics integration model, web data semantics, data heterogeneity, data warehouse, omics annotations, multi-layered architecture.

Graphical Abstract

[1]
Cook CE, Bergman MT, Finn RD, Cochrane G, Birney E, Apweiler R. The European bioinformatics institute in 2016: data growth and integration. Nucleic Acids Res 2016; 44(D1): D20-6.
[http://dx.doi.org/10.1093/nar/gkv1352] [PMID: 26673705]
[2]
Rigden DJ, Fernandez XM. The 26th annual nucleic acids research database issue and molecular biology database collection. Nucleic Acids Res 2019; 47(D1): D1-7.
[http://dx.doi.org/10.1093/nar/gky1267] [PMID: 30626175]
[3]
Rigden DJ, Fernandez XM. The 2018 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res 2018; 46(D1): D1-7.
[http://dx.doi.org/10.1093/nar/gkx1235] [PMID: 29316735]
[4]
Benson DA, Cavanaugh M, Clark K, et al. GenBank. Nucleic Acids Res 2018; 46(D1): D41-7.
[http://dx.doi.org/10.1093/nar/gkx1094] [PMID: 29140468]
[5]
Mashima J, Kodama Y, Fujisawa T, et al. DNA data bank of Japan. Nucleic Acids Res 2017; 45(D1): D25-31.
[http://dx.doi.org/10.1093/nar/gkw1001] [PMID: 27924010]
[6]
Toribio AL, Alako B, Amid C, et al. European nucleotide archive in 2016. Nucleic Acids Res 2017; 45(D1): D32-6.
[http://dx.doi.org/10.1093/nar/gkw1106] [PMID: 27899630]
[7]
Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: an overview. J Biol Res 2015; 22(1): 9.
[http://dx.doi.org/10.1186/s40709-015-0032-5] [PMID: 26336651]
[8]
Gligorijevic V, Przulj N. Methods for biological data integration: perspectives and challenges. J R Soc Interface 2015; 12(112): 20150571.
[http://dx.doi.org/10.1098/rsif.2015.0571] [PMID: 26490630]
[9]
Joyce AR, Palsson BO. The model organism as a system: integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 2006; 7(3): 198-210.
[http://dx.doi.org/10.1038/nrm1857] [PMID: 16496022]
[10]
Gomez-Cabrero D, Abugessaisa I, Maier D, et al. Data integration in the era of omics: current and future challenges. BMC Syst Biol 2014; 8(Suppl. 2): I1.
[http://dx.doi.org/10.1186/1752-0509-8-S2-I1] [PMID: 25032990]
[11]
Belleau F, Nolin M-A, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 2008; 41(5): 706-16.
[http://dx.doi.org/10.1016/j.jbi.2008.03.004] [PMID: 18472304]
[12]
Zheng J, Xiang Z, Stoeckert CJ Jr, He Y. Ontodog: a web-based ontology community view generation tool. Bioinformatics 2014; 30(9): 1340-2.
[http://dx.doi.org/10.1093/bioinformatics/btu008] [PMID: 24413522]
[13]
Wilkinson M, Schoof H, Ernst R, Haase D. BioMOBY successfully integrates distributed heterogeneous bioinformatics Web Services. The PlaNet exemplar case. Plant Physiol 2005; 138(1): 5-17.
[http://dx.doi.org/10.1104/pp.104.059170] [PMID: 15888673]
[14]
Dumontier M, Baker CJ, Baran J, et al. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semantics 2014; 5(1): 14.
[http://dx.doi.org/10.1186/2041-1480-5-14] [PMID: 24602174]
[15]
Neumann EK, Quan D. BioDash: a semantic web dashboard for drug development. Pac Symp Biocomput 2006; 11: 176-87.
[PMID: 17094238]
[16]
Livingston KM, Bada M, Baumgartner WA Jr, Hunter LE. KaBOB: ontology-based semantic integration of biomedical databases. BMC Bioinformatics 2015; 16: 126.
[http://dx.doi.org/10.1186/s12859-015-0559-3] [PMID: 25903923]
[17]
Cheung KH, Yip KY, Smith A, Deknikker R, Masiar A, Gerstein M. YeastHub: a semantic web use case for integrating data in the life sciences domain. Bioinformatics 2005; 21(Suppl. 1): i85-96.
[http://dx.doi.org/10.1093/bioinformatics/bti1026] [PMID: 15961502]
[18]
Haas LM, Schwarz PM, Kodali P, et al. DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst J 2001; 40(2): 489-511.
[http://dx.doi.org/10.1147/sj.402.0489]
[19]
Cadag E, Louie B, Myler PJ, Tarczy-Hornoch P. Biomediator data integration and inference for functional annotation of anonymous sequences. Pac Symp Biocomput 2007; 12: 343-54.
[PMID: 17990504]
[20]
Stevens R, Baker P, Bechhofer S, et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 2000; 16(2): 184-5.
[http://dx.doi.org/10.1093/bioinformatics/16.2.184] [PMID: 10842744]
[21]
Miled ZB, Li N, Kellett GM, Sipes B, Bukhres O. Complex life science multidatabase queries. Proc IEEE 2002; 90(11): 1754-63.
[22]
Davidson SB, Crabtree J, Brunk BP, et al. K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst J 2001; 40: 512-30.
[http://dx.doi.org/10.1147/sj.402.0512]
[23]
Smedley D, Haider S, Ballester B, et al. BioMart--biological queries made easy. BMC Genomics 2009; 10(22): 22.
[http://dx.doi.org/10.1186/1471-2164-10-22] [PMID: 19144180]
[24]
Freier A, Hofestadt R, Lange M, Scholz U, Stephanik A. BioDataServer: a SQL-based service for the online integration of life science data. In Silico Biol 2002; 2(2): 37-57.
[PMID: 12066840]
[25]
Davidson SB, Overton C, Tanen V, et al. BioKleisli: A digital library for biomedical researchers. Int J Digit Libr 1997; 1(1): 36-53.
[http://dx.doi.org/10.1007/s007990050003]
[26]
Ostell J. The Entrez Search and Retrieval System. 2002 Oct 9. The NCBI Handbook. 2nd ed. Bethesda, MD: National Center for Biotechnology Information (US) 2013. https://www.ncbi.nlm.nih.gov/books/NBK184582/Updated 2014 Jan 31 Internet.
[27]
Kersey P, Bower L, Morris L, et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005; 33(Suppl. 1): D297-302.
[http://dx.doi.org/10.1093/nar/gki039] [PMID: 15608201]
[28]
Etzold T, Ulyanov A, Argos P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol 1996; 266: 114-28.
[http://dx.doi.org/10.1016/S0076-6879(96)66010-8] [PMID: 8743681]
[29]
Masseroli M, Canakoglu A, Ceri S. Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans Comput Biol Bioinformatics 2016; 13(2): 209-19.
[http://dx.doi.org/10.1109/TCBB.2015.2453944] [PMID: 27045824]
[30]
Shah SP, Huang Y, Xu T, Yuen MM, Ling J, Ouellette BF. Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics 2005; 6(1): 34.
[http://dx.doi.org/10.1186/1471-2105-6-34] [PMID: 15723693]
[31]
Smith RN, Aleksic J, Butano D, et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics 2012; 28(23): 3163-5.
[http://dx.doi.org/10.1093/bioinformatics/bts577] [PMID: 23023984]
[32]
Birkland A, Yona G. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 2006; 7(70): 70.
[http://dx.doi.org/10.1186/1471-2105-7-70] [PMID: 16480510]
[33]
Lee TJ, Pouliot Y, Wagner V, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 2006; 7(170): 170.
[http://dx.doi.org/10.1186/1471-2105-7-170] [PMID: 16556315]
[34]
Irshad O, Ghani U. Integration and querying of heterogeneous omics semantic annotations for biomedical and biomolecular knowledge discovery. Curr Bioinform 2020; 15(1): 41-58.
[http://dx.doi.org/10.2174/1574893614666190409112025]
[35]
Rhee DB, Croken MM, Shieh KR, et al. toxoMine: an integrated omics data warehouse for Toxoplasma gondii systems biology research. Database (Oxford) 2015; 2015bav066
[http://dx.doi.org/10.1093/database/bav066] [PMID: 26130662]
[36]
Cornell M, Paton NW, Wu S, et al. GIMS- a data warehouse for storage and analysis of genome sequence and functional data. Proceedings of the 2nd IEEE international symposium on bioinformatics and bioengineering. Bethesda, MD, USA. 2001.
[http://dx.doi.org/10.1109/BIBE.2001.974407]
[37]
Trissl S, Rother K, Muller H, et al. Columba: an integrated database of proteins, structures, and annotations. BMC Bioinformatics 2005; 6: 81.
[http://dx.doi.org/10.1186/1471-2105-6-81] [PMID: 15801979]
[38]
Hedeler C, Wong HM, Cornell MJ, et al. e-Fungi: a data resource for comparative analysis of fungal genomes. BMC Genomics 2007; 8: 426.
[http://dx.doi.org/10.1186/1471-2164-8-426] [PMID: 18028535]
[39]
Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The distributed annotation system. BMC Bioinformatics 2001; 2(1): 7.
[http://dx.doi.org/10.1186/1471-2105-2-7] [PMID: 11667947]
[40]
Wolstencroft K, Haines R, Fellows D, et al. The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 2013; 41(Web Server issue): W557-61.
[http://dx.doi.org/10.1093/nar/gkt328]
[41]
Childs LH, Mamlouk S, Brandt J, Sers C, Leser U. SoFIA: a data integration framework for annotating high-throughput datasets. Bioinformatics 2016; 32(17): 2590-7.
[http://dx.doi.org/10.1093/bioinformatics/btw302] [PMID: 27187206]
[42]
Blankenberg D, Coraor N, Kuster GV, et al. Integrating diverse databases into an unified analysis framework: a galaxy approach. Database (Oxford) 2011; (bar011): 1-9.
[http://dx.doi.org/10.1093/database/bar011]
[43]
Ostrowski D, Rychtyckyj N, MacNeille P, Kim M. Integration of big data using semantic web technologies 2016 IEEE Tenth International Conference on Semantic Computing (ICSC). 2016 Feb 4-6; Laguna. 382-5.
[44]
Goble C, Stevens R. State of the nation in data integration for bioinformatics. J Biomed Inform 2008; 41(5): 687-93.
[http://dx.doi.org/10.1016/j.jbi.2008.01.008] [PMID: 18358788]
[45]
Callahan A, Cruz-Toledo J, Dumontier M. Ontology-based querying with Bio2RDF’s linked open data. J Biomed Semantics 2013; 4(Suppl. 1): S1.
[46]
Zaki N, Tennakoon C. BioCarian: search engine for exploratory searches in heterogeneous biological databases. BMC Bioinformatics 2017; 18(1): 435.
[http://dx.doi.org/10.1186/s12859-017-1840-4] [PMID: 28969593]
[47]
Pinero J, Queralt-Rosinach N, Bravo A, et al. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015; 2015bav028
[http://dx.doi.org/10.1093/database/bav028] [PMID: 25877637]
[48]
Hu W, Qiu H, Huang J, Dumontier M. BioSearch: a semantic search engine for Bio2RDF. Database 2017; 2017bax059
[http://dx.doi.org/10.1093/database/bax059] [PMID: 29220451]
[49]
Schraefel MC, Smith DA, Owens A, Russell A, Harris C, Wilson M. The evolving mSpace platform: leveraging the semantic web on the trail of the memex. Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (HYPERTEXT ’05) 2005; 174-83.
[http://dx.doi.org/10.1145/1083356.1083391]
[50]
Erling O. Virtuoso, a hybrid rdbms/graph column store. Q Bull Comput Soc IEEE Tech Comm Data Eng 2012; 35(1): 3-8.
[51]
Longwell RDF. Browser, SIMILE. Available from https://www.w3.org/2001/sw/wiki/Longwell. (Accessed on February, 2017).
[52]
Huynh DF, Karger D. Parallax and companion: Set-based browsing for the data web: 2008: International World Wide Web Conference Committee (IW3C2); Madrid 2009.
[53]
Hildebrand M, Ossenbruggen JV, Hardman L. /facet: A Browser for Heterogeneous Semantic Web Repositories: 2006: In: 5th International Semantic Web Conference, ISWC; Nov 5-9. Athens, GA, USA 2006.
[54]
Kobilarov G, Dickinson I. Humboldt: exploring linked data. LDOW 2008; 6: 7.
[55]
Heim P, Ziegler J, Lohmann S. gFacet: a browser for the web of data. Proceedings of the International Workshop on Interacting with Multimedia Content in the Social Semantic Web (IMCSSW08). In: 2008; vol 417Koblenz: pp. 49-58.
[56]
Berners-Lee T, Hollenbach J, Lu K, Presbrey J. Tabulator redux: browsing and writing linked data. ceur workshop proceedings 2008.
[57]
Momtchev V, Peychev D, Primov T, Georgiev G. Expanding the pathway and interaction knowledge in linked life data. In: Proceedings of International Semantic Web Challenge. 2009.
[58]
Schatz MC. Biological data sciences in genome research. Genome Res 2015; 25(10): 1417-22.
[http://dx.doi.org/10.1101/gr.191684.115] [PMID: 26430150]
[59]
Wilson G, Aruliah DA, Brown CT, et al. Best practices for scientific computing. PLoS Biol 2014; 12(1): e1001745.
[http://dx.doi.org/10.1371/journal.pbio.1001745] [PMID: 24415924]
[60]
Masouleh MF, Kazemi MA, Alborzi M, et al. Optimization of ETL process in data warehouse through a combination of parallelization and shared cache memory. Eng Technol Appl Sci Res 2016; 6: 1241-4.
[61]
Simitsis A, Vassiliadis P, Sellis T. Optimizing ETL Processes in Data Warehouses. Proceedings of the 21st International Conference on Data Engineering. 2005 April 5-8; Tokoyo, Japan. pp. 564-75.
[62]
XML and Semantic Web W3C Standards Timeline. Available from: dblab.ntua.gr/~bikakis/XMLSemanticWebW3CTimeline.pdf (Accessed on February 04, 2014).
[63]
Masseroli M, Galati O, Pinciroli F. GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res 2005; 33(Web Server issue): W717-23.
[http://dx.doi.org/10.1093/nar/gki454] [PMID: 15980570]
[64]
Pastor O, Casamayor JC, Celma M, Mota L, Pastor MA, Levin AM. Conceptual Modeling of Human genome: integration challenges. In: Conceptual Modelling and Its Theoretical Foundations Dusterhoft A, Klettke M, Schewe KD, Ed. Springer Verlag: Heidelberg 2012; 7260: pp. 231-50.
[http://dx.doi.org/10.1007/978-3-642-28279-9_17]
[65]
Bornberg-Bauer E, Paton NW. Conceptual data modelling for bioinformatics. Brief Bioinform 2002; 3(2): 166-80.
[http://dx.doi.org/10.1093/bib/3.2.166] [PMID: 12139436]
[66]
Chromiak M, Grabowiecki M. Heterogeneous data integration architecture-challenging integration issues. Informatica 2015; 15: 7-11.
[67]
Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform 2007; 40(1): 5-16.
[http://dx.doi.org/10.1016/j.jbi.2006.02.007] [PMID: 16574494]
[68]
Cross V. XinranYu, Xueheng Hu. Unifying ontological similarity measures: A theoretical and empirical investigation. Int J Approx Reason 2013; 54(7): 861-75.
[http://dx.doi.org/10.1016/j.ijar.2013.03.003]
[69]
Althobaiti AFS. Comparison of ontology-based semantic- similarity measures in the biomedical text. J Comp Commun 2017; 5: 17-27.
[http://dx.doi.org/10.4236/jcc.2017.52003]
[70]
Gan M, Dou X, Jiang R. From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorld 2013; 2013(10): 793091.
[http://dx.doi.org/10.1155/2013/793091] [PMID: 23533360]
[71]
Alsubait T, Parsia B, Sattler U. Measuring Similarity in Ontologies: A New Family of Measures. In: Knowledge Engineering and Knowledge Management - 19th International Conference, EKAW. 2014; pp. 13-25.
[http://dx.doi.org/10.1007/978-3-319-13704-9_2]
[72]
Chen H, Yu T, Chen JY. Semantic web meets integrative biology: a survey. Brief Bioinform 2013; 14(1): 109-25.
[http://dx.doi.org/10.1093/bib/bbs014] [PMID: 22492191]
[73]
RDF Schema 1.1, W3C Recommendation Available https://www.w3.org/TR/rdf-schema/ (Accessed on 25 February 2014).
[74]
Yates B, Braschi B, Gray KA, Seal RL, Tweedie S, Bruford EA. Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res 2017; 45(D1): D619-25.
[http://dx.doi.org/10.1093/nar/gkw1033] [PMID: 27799471]