Computational and Bioinformatics Tools for Phosphoproteomics

The maturing of proteomics technologies has led to an avalanche of data that contains potential insights into the workings of living organisms. In common with genomics and the emerging field of metabolomics, it permits researchers to study how components of a cell work in concert with one another. When merged with data from the other omics technologies, an integrated picture of how biology is regulated will develop. Processing the gigabytes of raw data from protein identification through quantitation to mapping the data onto known biological processes is now impossible to do manually. Sophisticated algorithmic and software tools are needed to process the raw data and to derive biological insight from it. Computational proteomics is an emerging field. While proteomics has much in common with genomics, particularly in the bioinformatic arenas of functional analysis and data mining, it is sufficiently different that unique computational tools are necessary to produce reliable results for downstream bioinformatics analysis. Phosphoproteomics data in particular is unique in that individual sites may be regulated, quite independent of any change in protein expression. Therefore, the site itself becomes the focus of the biology, not just the protein itself. As described in the introduction, on many proteins, multiple sites are phosphorylated and regulated independent of one another. Unfortunately most bioinformatics tools have been developed from a genomic perspective and fail to capture the intricacies of phosphorylation (or other PTM)-dependent regulations. Computational proteomics and bioinformatics tools to interrogate and integrate the phosphoproteome are still maturing but are attracting an increasing amount of interest in the computational biology community.

The identification of phosphopeptides in a phosphoproteomics experiment begins like any other proteomics experiments, with peptide sequencing via data-dependent MS/MS. This is followed, after all the data has been collected, by applying database searching tools to the MS/MS spectra. Search engine algorithms, such as Mascot [245] and Sequest [246], use an amino acid sequence to calculate a theoretical spectrum for each possible peptide sequence and output the one that most closely matches the mass spectrum and tandem mass spectrum data. For most search engines, two measures of statistical significance are usually reported for peptide identification. The first focuses on individual peptides and represents the likelihood of a match between theoretical and experimental spectra occurring by chance. The second measure is the false discovery rate (FDR) for the entire data set, which estimates the total number of incorrect matches expected from all identified peptides [247].

Analysis of phosphopeptides offers several additional layers of complexity over the identification of unmodified peptides. In the first place, for a search of phosphopeptide data, an additional variable modification must be added to the database, which exponentially increases the search space and decreases the likelihood of matching a spectrum to a sequence at a given FDR. Secondly, as described previously, localizing the site of phosphorylation can be severely difficult [248]. Localization requires identification of a fragment ion that contains the phosphate group and is unique for the phosphorylated amino acid. In practice, however, such fragment ions may not be present or may be indistinguishable from noise. In general, search engines have not been optimized to localize modification sites and only recently have a few integrated scoring measures of localization confidence. Multiply phosphorylated peptides can exacerbate this problem even further, as it can exponentially increase the number of possible site combinations within a given sequence, which in turn makes confident localization more difficult.

Several software packages are now available (Table 2.1) that calculate localization probabilities, making it possible to derive a measure of confidence in phosphosite localization for a give spectrum [248]. Such programs must consider numerous variables that can affect phosphosite identification. First, they must distinguish true peaks from chemical noise. Next, they must decide which peaks are suggestive of a modification site and how much weight each

Table 2.1 Site localization tools.




ion tool access






Probability based




Score difference

No tool available

Yes (all)




Probability based


Yes (Top 2)

[251, 252]

Mascot Delta

Score difference


Yes (All)


MaxQuant PTM Score

Probability based


Yes (All)



Probability based

Available as part of Proteome Discoverer

Yes (All)



Probability based

Request from developers

Yes (All)



Score difference








Score difference


Yes (All)





Score difference




Yes (All)


SLoMo/ Turbo SLoMo

Probability based


Yes (Top 2)


Adapted from “Perkins, D.N., Pappin D.J., Creasy D.M., Cottrell J.S. (1999), NCBI.”

peak should have in determining the site localization. For example, a given peak may be a result of two or more possible fragments, and the software must take this into account. Finally, phosphosite identification tools must calculate a confidence metric for the particular site that is called by the analysis.

There are two main strategies that are built into phosphosite localization scoring mechanisms [248, 249]. The first involves measuring the probability of each possible phosphorylation site using ions from the fragmentation spectra. First, ions that are “site determining” are identified from the theoretical fragmentation spectrum of the identified peptide sequence. These are fragment ions that can be used to uniquely assign a particular phosphorylation site. A probability of a phosphosite match for each potential modification site is determined by comparing the number of identified theoretical site-determining ions with the number of matched ions from the theoretical spectra. This probability is calculated for each possible site, and the resulting probabilities are compared to one another to make a call on the most likely phosphosite. Several popular localization algorithms utilize this strategy, including A-Score [254], PTM score [175], PhosphoRS [251], and LuciPhor [253]. The second strategy for measuring confidence in site localization utilizes results from the search engine report. The search engine provides an identification score for each possible phosphopeptide isoform of a given sequence. The difference in the scores can be used to estimate the reliability of a given localization. Tools that rely on this strategy include Mascot Delta Score [258], SLIP [250], and D-Score [260].

It is clear that there is still quite a bit of variation in the output of these different data analysis tools. In one case, the ABRF provided 35 participants with the same phosphoproteomics data set, and they were asked to identify the phosphosites present in the sample with a 1% FDR. Phosphosites were unanimously agreed upon for 79% of the peptide spectral matches; however, there was a disagreement in site localization for the remaining 21% of data set (reviewed in [248]). In a more recent study, Kuster and colleagues created a library of >57,000 synthetic phosphopeptides with known sequences and phosphosites that was used to compare the accuracy and sensitivity of three different site localization tools: PhosphoRS, PTM Score, and Mascot Delta Score [32]. Given that sequences and sites were known, the authors were able to calculate a true false localization rate (FLR) for the data. PhosphoRS and PTM Score provided a larger number of correctly localized phosphorylated peptides than Mascot Delta Score at a true FDR of 1% and FLR of 1%. However, each of the three tools underestimated the FLR at a given probability. Importantly, the results of these different algorithms were highly complementary, in that each provided a different subset of correctly localized peptides. This observation suggests that using different localization tools on the same data set may increase the number of correctly identified phosphosites and that further optimization of these algorithms is needed.

A further complication of assigning phosphosites occurs when working with quantifying phosphopeptides from biological replicates. There are often cases where a phosphorylated peptide sequence is confidently identified in more than one replicate. However, in one replicate, the site is assigned with high confidence, whereas in the other replicates it cannot be assigned. Even within the same replicate, one may have several peptide spectral matches for the same phosphopeptide sequence, where some of the matches have a confident localization and others do not. Whether the analyst can assume these peptides are indeed the same, and thus be considered for quantification across and within the replicates, is a subject of debate. Other pieces of evidence, such as retention time, can sometimes be of use, but often isomers of the same peptide will have very similar retention times and are thus indistinguishable. Some tools, such as MaxQuant [261], assume that if the confidence in localization is at least 75% in one replicate and at least 50% in other replicates, then the peptides are likely the same isomer and should be considered as such for downstream analysis [230]. However, the phosphoproteomics community has yet to reach a consensus on how to address this problem. The reliability of site localizations is a significant issue when one considers the proliferation of proteomics databases.

The target decoy database strategy for estimating global peptide identification FDR has proven useful for comparing different identification tools in an unbiased manner. With this approach, mass spectra are searched against both a true database and a database in which the same protein sequences have been either randomized or reversed. All matches to the decoy database are counted to estimate the number of false identifications within your data set at a given score cutoff [262]. However, this approach cannot be directly extended to calculate an equivalent FLR for modification site localization. This is because an incorrect site localization is not a random match. Rather, the incorrect and correct localizations are extremely similar. Therefore, using a decoy sequence is not an accurate estimation of the true error inherent in a data set [248]. Tools have been developed that modify this decoy strategy to more accurately measure a global FLR. A majority of the most current tools to measure global FLR for phosphoproteomics involve computationally allowing the phosphate group to modify amino acids that biologically cannot bear the modifications [250, 253]. A match to one of these residues is analogous to a match to a decoy database sequence, and the number of decoy matches in a global data set allows an estimation of FLR. However, picking which residues to use as a decoy can be tricky. For an accurate estimate, the decoy residues should be present with a similar frequency to the true sites throughout the tryptic peptide proteome. In addition, the decoy residues should have a similar proximity to the true modified site as other potential true sites do, as it is more likely that a mislocaliza- tion will occur on residues that are closer to the actual modified site [248]. Incorporating these parameters can be a tricky computational task, and optimizing tools for FLR estimation is an active area of investigation.

Over the last decade, many public databases have been created to store phos- phosite information [139]. Phospho.ELM [263] and PhosphoSitePlus [29] are two databases that originally focused solely on phosphopeptides but have expanded to include other PTMs. Phospho.ELM contains more than 42,000 phosphosites, 90% of which were identified from high-throughput proteomics experiments. The vast majority of these sites were obtained from human and mouse samples [263]. Each phosphosite is reported along with a conservation score, as well as an accessibility score based either on the crystal structure of the protein or as predicted by the RealSpine accessibility algorithm.

PhosphoSitePlus, a large database of PTMs initiated in 2003, has since grown to encompass data from thousands of data sets and includes unpublished results generated from in-house data. As of the end of 2014, the site includes over 330,000 nonredundant PTMs from a large cohort of model eukaryotic organisms [29], including over 240,000 nonredundant phosphosites.

Several other more general databases are also useful resources of phosphosites. The Human Protein Reference Database (HPRD), a curated source of proteomics data from human samples, includes phosphosite information obtained from human phosphoproteomics experiments [264, 265]. Other organism-specific phosphosite databases include PhosPhat for Arabidopsis thaliana [266] and PhosphoPep for Drosophila melanogaster, Caenorhabditis elegans, and S. cerevisiae [267]. The Universal Protein Resource (UniProt) is a more species comprehensive source of protein sequence and annotation data [268]. UniProt compiles data from a large variety of sources and curates this into a single knowledgebase. This includes known PTMs such as phosphorylation. Other PTM databases often encompass phosphorylation as well. This includes PHOSIDA, which links extensive peptide information to the modification sites [269]; SysPTM, which compiles data on over 50 different PTMs [270]; and dbPTM, which integrates both large public databases with modification site data mined from the literature [271].

Although these databases can be rich resources for hypothesis generation, the quality of the data being analyzed must be carefully considered. The vast majority of the phosphosites in these databases come from published reports of large-scale phosphoproteomics experiments, with the assumption that authors have correctly identified the sites of modification. However, much of the data were generated prior to any guidelines from journals to assess site localization, and many of these studies did not address localization at all [248]. More recently collected data can also vary in quality depending on the tool used to measure localization reliability. Researchers who use the information in phosphopeptide databases should remain aware of the degree of uncertainty inherent to these resources. It has been suggested that researchers who produce and analyze the raw data should lean toward a conservative estimate of identification/localization reliability [248].

Cataloging phosphosites may provide some insight into how a protein or set of proteins might be regulated by PTM, but quantitative phosphoproteomics experiments are often utilized to gain a deeper mechanistic understanding of the biologic pathways that differentiate two or more cell states. However, analysis of the information produced from these experiments without incorporation of other tools is rarely enlightening. Because a single phospho- proteomics experiment can only quantify a small subset of all phosphosites, a full understanding of the biologic pathways involved requires the integration of several types of data to build phosphorylation networks. There are two general nonmutually exclusive approaches to developing phosphorylation-mediated signaling networks [272]. The first focuses solely on the protein level (Figure 2.9a). All phosphoproteins that are found to contain a regulated phosphosite are mapped to known pathways or protein-protein interaction

Constructing networks from global phosphoproteomics data

Figure 2.9 Constructing networks from global phosphoproteomics data. Significantly modulated phosphopeptides as measured from a phosphoproteomics data set are first mapped to protein sequences. (a) These protein sequences can then be directly mapped onto known signaling pathways or protein-protein interaction maps. Protein complexes or pathways that are enriched in these maps potentially participate in the response to the stimulus under study. (b) The amino acid sequence around the phosphorylation sites are used to predict active kinases that preferentially phosphorylate this sequence motif. This allows for the construction of kinase substrate phosphorylation networks. These maps enable more connections to be made between regulated phosphosites by linking them with predicted kinase activities that are modulated. Note that not all phosphorylation sites can be matched with predicted regulatory kinases, but these sites can still be mapped to proteins and included in the final network maps. These provide the basis for hypothesis generation and further biological study.

networks to identify modules that are involved in a biological process of interest. Software programs like Ingenuity Pathway Analysis (www.ingenuity. com), PANTHER [273], and KEGG [274] provide databases and algorithms to measure enrichment of given pathways in data sets. STRING can be used to identify both known and predicted interaction partners of the phosphopro- teins of interest [275]. Integration of these tools can help provide a more complete understanding of the signaling pathways implicated by quantitative phosphoproteomics experiments.

These bioinformatics tools, although extremely useful, do not utilize key data obtained from phosphoproteomics experiments: the phosphorylation sites themselves (Figure 2.9b). There is a growing list of tools being developed that create kinase substrate networks by incorporating information on kinases and phosphatases that are known to act upon phosphosites identified in phos- phoproteomics studies. The first step is to identify sequence motifs enriched in the phosphopeptide data that are known kinase substrates, which can provide information on modulation of kinase activity. Software programs like Scansite [276], GPS [277], and NETPHOSK [278] are used to predict kinase-specific phosphorylation sites from hundreds of kinases and identify sequence motifs that are preferred substrates. Using these tools, one can infer whether various kinases are active or repressed based on sequence motifs that are enriched in a data set of modulated phosphosites. Moreover, tools like NetworKIN [279], KinomeXplorer [280], and iGPS [281] take other contextual factors such as colocalization data and direct or indirect protein-protein interaction data into account to make connections between various active kinase modules and create large networks of potential signaling pathways, allowing the generation of testable hypotheses. In addition, phosphopeptide databases like PhosphoSitePlus, Phospho.ELM, and HPRD all incorporate tools that search for motifs and identify kinase-substrate relationships.

An important point to remember when considering these tools is that they rely heavily on prior knowledge, and so their ability to reveal less studied pathways or novel kinase-substrate interactions is limited. Because a majority of the literature focuses on a small subset of kinases and pathways, there may be overrepresentation of certain pathway components and masking of unexpected relationships that may be important for a given cell state. Tools such as SELPHI are attempting to ameliorate this problem by identifying relationships between kinases and substrates in unbiased ways [282]. Nevertheless, a researcher using these tools should still be aware of the biases that may result from using software that relies on previous knowledge to generate its reports.

There is also a growing interest in integrating phosphoproteomics data with genomic and transcriptomic data to determine how mutations in or aberrant activity of certain genes and proteins can lead to changes in phosphorylation cascades, which in turn affect gene expression. This integrative approach can sometimes be difficult because the technologies are so different, there is often inconsistency between experiments, and samples are rarely matched to the same individual [139]. Recent studies have been undertaken to determine how disease-causing genomic mutations lead to dysregulation of pathways and large-scale changes in gene expression. One recent study integrated phospho- proteomics data from various sources with cancer genomic data from The Cancer Genome Atlas (TCGA) to identify phosphosites that are frequently mutated in cancer. Of over 87,000 phosphosites, 150 were identified as

frequently mutated in known or candidate cancer-driving genes [283]. Another study integrated transcriptomic, proteomics, and phosphoproteomics data from 13 non-small cell lung cancer cell lines to identify driver pathways specifically active in KRAS-dependent lines [284]. These examples illustrate how multiple methodologies and tools can be integrated to build comprehensive models of molecular signaling networks.

Despite many innovations in instrumentation and data analysis tools, phos- phoproteomics experiments still suffer from incomplete coverage and a certain amount of irreproducibility across platforms [272]. In addition, single phos- phopeptide measurements are insufficient for a full understanding of biological systems. Taken together, phosphorylation network signatures will serve as more informative and reproducible measures of biological system perturbations when integrated with other data types.

< Prev   CONTENTS   Source   Next >