The Importance of Bioinformatics

The ultimate goal of proteomics is to obtain a picture of the entire complement of proteins without gaps. Genomics has already achieved this goal at the level of DNA and RNA by mapping complete genotypes. Proteomics, however, aims to describe phenotypes that display a significantly more complex functional diversity in a dynamic environment. Historically, proteomics tried to approach this challenge by establishing comparably primitive approaches such as two-dimensional gels, which gave the genomics field a competitive edge. In the last decade, however, mass spectrometry has become the method of choice, and recent advances allow the measurement of expression and modification states of thousands of proteins in a single experiment. In the last few years, the number of identified PTM sites, in particular, phosphorylation sites, has increased up to 100-fold [47]. Furthermore, mass spectrometry enables the reconstruction of protein interactions in networks and complexes. Shotgun proteomics is the most widely used approach generating thousands of spectra per hour. Therefore computational methods have to face a huge amount of generated data and a combinatorial explosion in the number of potential molecular states of proteins. In the early era of mass spectrometry as a high- throughput technology, computational analysis was commonly considered the “Achilles heels of proteomics” [48] because of the alarmingly high false discovery rates accompanied with the absence of adequate statistical methods. Fortunately, the establishment of stringent standards by the community [49] and the development of robust computational methods dragged the false discovery rates down to one percent and reduced the fraction of unassigned spectra to 10% [50].

The primary problem that all computational approaches try to solve is to assign a given MS/MS spectrum to a peptide sequence within the shortest amount of time. The most common approach is to generate theoretical fragment masses for candidate peptides from a specified protein sequence database and map these against experimental spectra. The pool of possible peptides is mainly defined by the proteolytic enzyme, mass tolerance, and specified PTM. Numerous software tools have been developed to this end [51], and they mainly differ in scoring the similarities between calculated and experimental spectra and in the statistical validation of results. SEQUEST [30] is one of the first and most commonly used tools for MS/MS-based proteomics. Its scoring scheme is based on spectral correlation functions that basically count “matched peaks" defined as the number of fragment ions common between the computed and experimental spectra. Mascot [31] extends this approach by estimating the probability of observing the shared peak count by chance. Because Mascot is a commercial software, the underlying algorithms are not provided. The search engine Andromeda [32], which is integrated into the freely available MaxQuant platform [52], also employs probabilistic scores. Notably, because selection of precursor ion for fragmentation is performed with low resolution to ensure high sensitivity, coeluting peptides with similar masses are frequently cofragmented. While the resulting "chimerical" MS/MS spectra [53] usually distort the detection and quantitation of peptides, Andromeda includes an algorithm that detects the “second” peptide and uses this information to increase the identification rate.

Other computational tools such as Protein Prospector [54] employ empirical scoring schemes that incorporate the number of matched peaks as well as the fraction of total peak intensities that can be explained by them. But when it comes to the identification of PTM sites, all methods face the same issue of the combinatorial explosion of theoretical peptides in cases where too many variable modification types are allowed. Consequently, spectra-to-peptide searches are usually restricted to up to three modifications. However, Byonic [55], which is also based on the principle of matching experimental to theoretical spectra, allows a larger number of modification types by setting an upper limit on the total occurrence of each modification. Furthermore, Byonic provides “wildcard” searches that allow the detection of unanticipated modifications by searching within specified mass delta windows.

In addition to the combinatorial explosion of theoretical peptides, another challenge in the analysis of PTMs is the precise localization of PTMs within peptides. Since PTM sites of the same protein commonly display distinct behaviors [56], it is imperative to determine their exact localizations. To this end, Ascore [57] assesses the probability of correct site localization based on the presence and intensity of site-determining ions. The corresponding algorithm essentially reflects the cumulative binomial probability of identifying site-determining ions. The same concept is used by the “localization probability score,” [56] which is integrated into MaxQuant.

After the identification of peptides and associated PTMs, output scores of database search tools are translated into estimated false discovery rates. To this end, “target-decoy searching” [58] is commonly applied. The main idea of this approach is to search MS/MS spectra against a target database that contains protein sequences and reversed counterparts. Under the assumption that false matches to sequences from the original database and matches to decoy peptide sequences follow the same distribution, peptide identifications are filtered using score cutoffs corresponding to certain FDRs.

Taken together, technological advances and accompanied developments of computational methods now allow the routine identification of thousands of proteins, including PTM sites, giving a global and hopefully soon a complete picture of the proteome. Bioinformatics approaches have mastered many problems in the analysis of proteomics data but are still facing several challenges including the decryption of unmatched spectra. The accumulation of detected PTM sites across studies has been managed by various databases, including UniProt ( [2], PhosphoSite ( [59], and PHOSIDA ( [60].

< Prev   CONTENTS   Source   Next >