Computational Prediction of Phosphorylation Sites

The PhosphoBase (6 (link)) consists of 1883 experimentally verified phosphorylation sites within 597 protein entries. The number of serine, threonine and tyrosine sites is 984, 246 and 653, respectively. Swiss-Prot (7 (link)) (release 45 of October 2004) maintains 163 500 protein entries, of which 3614 have phosphorylation annotation. Among these entries, the number of serine, threonine and tyrosine sites was 1005, 281 and 321, respectively. Generally, the serine, threonine and tyrosine, which are not annotated as phosphorylation residues, within the experimentally validated phosphorylated proteins, are selected as negative sets, i.e. the non-phosphorylated sites. Therefore, two negative (non-phosphorylated) datasets were obtained from the PhosphoBase and Swiss-Prot based on the phosphorylation annotation. Because of the absence of good negative dataset exists for non-phosphorylated sites, the residues that had not been previously annotated as phosphorylated in phosphorylation annotated proteins were chosen as a reflection of more general non-phosphorylated sites. Supplementary Table S1 summarizes the statistics of kinase-specific phosphorylated sites used for learning models in the proposed application. This work confirms the existence of two major protein kinases phosphorylating either at serine/threonine residues or at tyrosine residues.
Figure 1 depicts a flowchart of the proposed method. Phosphorylated sites were first extracted as positive sets; non-phosphorylated sites were extracted as negative sets, and the catalytic kinase annotations were obtained from PhosphoBase and Swiss-Prot. The positive sets were then categorized by catalytic kinases. Alternatively, in larger positive groups, the sequences of the phosphorylated sites can be clustered into subgroups by maximal dependence decomposition (MDD) (8 (link)). The MDD was first applied in nucleotides and is a recursive process to divide a sequence set into tree-like subgroups based on the positional dependency of the sequences. Here, we applied the MDD to group protein phosphorylation substrates into subgroups. As the example given in Figure 1, 232 phosphorylation serine substrates are grouped into subgroups. When applying MDD to cluster the sequences of a positive set, a parameter, i.e. the minimum-cluster-size, should be set. If the size of a subgroup is less than the minimum-cluster-size, the subgroup is terminated to be divided. The MDD process terminates until all the subgroup sizes are less than the minimum-cluster-size.
Thereupon, the concept of the profile HMM was adopted to learn computational models from positive sets of phosphorylation sites. To evaluate the learned models, k-fold cross-validation and leave-one-out cross-validation were performed on them. After evaluating the models, the model with highest accuracy in each dataset was chosen.
For each kinase-specific positive set of the phosphorylated sites, the best performed model is selected and used to identify the phosphorylation sites within the input protein sequences by HMMsearch (9 (link)). To search the hits of a model, HMMER returns both a HMMER bit score and an expectation value (E-value). The HMMER bit score is used as the criterion to define a HMM match. We select the HMMER score as the criterion to define a HMM match. A search of a model with the HMMER score greater than the threshold t is defined as a positive prediction, i.e. a HMM recognizes a phosphorylation site. The threshold t of each model is decided by maximizing the accuracy measure during a variety of cross-validations with the HMM bit score value range from 0 to −10. For example, Supplementary Figure S1 depicts the optimization of the threshold of the HMM bit scores in the S_PKA model. The threshold of the S_PKA model is set to −4.5 to maximize the accuracy measure of the model.
When considering a MDD-clustered dataset, for example, MDD-clustered PKA catalytic serine (S_PKA), the HMMs are trained separately from the subgroups of the phosphorylated sites resulted by MDD. Each model is used to search in the given protein sequences for the phosphorylated sites. A positive prediction of a model group is defined by at least one of the models that makes a positive prediction, whereas a negative prediction is defined as all the models that make negative predictions.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Huang H.D., Lee T.Y., Tzeng S.W, & Horng J.T. (2005). KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Research, 33(Web Server issue), W226-W229.

Publication 2005

Based sequences Catalytic Hmms Kinase Nucleotides Phosphorylation Protein Protein kinases Protein sequences Reflection Serine Terminates Threonine Tree Tyrosine

Corresponding Organization : National Central University

Other organizations : National Yang Ming Chiao Tung University

Top 5 similar protocols

Protocol cited in 34 other protocols

Variable analysis

independent variables

Phosphorylation sites
Catalytic kinase annotations

dependent variables

Kinase-specific phosphorylated sites used for learning models
Accuracy of the learned models

control variables

Non-phosphorylated sites (serine, threonine and tyrosine) within the experimentally validated phosphorylated proteins
Minimum-cluster-size parameter for the Maximal Dependence Decomposition (MDD) process
Threshold 't' for defining a positive prediction based on the HMMER bit score

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!