A major goal of proteomics is the complete description of all the proteins present in cells, tissues and biological fluids. The method of choice for identifying and characterizing proteins for such purposes is protease digestion coupled with mass spectrometry (MS) and subsequent protein sequence database searching. New software tools to increase the sensitivity and specificity of MS based protein identification and methods for evaluating the validity of the peptide-mass spectrum matches have been developed and existing software has generally been improved. However, with the ongoing rapid increase in both volume and fragmentation of publicly available MS protein data, the development and adoption of data standards has become pivotal to the realization of integrated systems biology investigations. Unfortunately, the native data standards used by each type of mass spectrometers, each database search engine, and each public database currently differ. The diverse, nontransparent nature of the proprietary data structures complicates the necessary data integration and data comparison across experiments. To overcome this problem, data standards have been developed through the extensible markup language (XML). To date, the most comprehensive standardization attempt has been concomitantly conducted by the Institute for Systems Biology (mzXML, PepXML, ProtXML) and the Proteomics Standard Initiative (mzData, PSIMI). Their standards eliminate the need to support multiple input formats and significantly facilitate the exchange and publication of MS-based proteomic data. In this article, we also discuss the standards used for biological proteomic data representation in order to facilitate interpretation and dissemination of research results.
Keywords: Proteomics, bioinformatics, open XML