The PhosphoBase (6 (
link)) consists of 1883 experimentally verified phosphorylation sites within 597 protein entries. The number of serine, threonine and tyrosine sites is 984, 246 and 653, respectively. Swiss-Prot (7 (
link)) (release 45 of October 2004) maintains 163 500 protein entries, of which 3614 have phosphorylation annotation. Among these entries, the number of serine, threonine and tyrosine sites was 1005, 281 and 321, respectively. Generally, the serine, threonine and tyrosine, which are not annotated as phosphorylation residues, within the experimentally validated phosphorylated proteins, are selected as negative sets, i.e. the non-phosphorylated sites. Therefore, two negative (non-phosphorylated) datasets were obtained from the PhosphoBase and Swiss-Prot based on the phosphorylation annotation. Because of the absence of good negative dataset exists for non-phosphorylated sites, the residues that had not been previously annotated as phosphorylated in phosphorylation annotated proteins were chosen as a reflection of more general non-phosphorylated sites. Supplementary Table S1 summarizes the statistics of kinase-specific phosphorylated sites used for learning models in the proposed application. This work confirms the existence of two major protein kinases phosphorylating either at serine/threonine residues or at tyrosine residues.
Figure 1 depicts a flowchart of the proposed method. Phosphorylated sites were first extracted as positive sets; non-phosphorylated sites were extracted as negative sets, and the catalytic kinase annotations were obtained from PhosphoBase and Swiss-Prot. The positive sets were then categorized by catalytic kinases. Alternatively, in larger positive groups, the sequences of the phosphorylated sites can be clustered into subgroups by maximal dependence decomposition (MDD) (8 (
link)). The MDD was first applied in nucleotides and is a recursive process to divide a sequence set into tree-like subgroups based on the positional dependency of the sequences. Here, we applied the MDD to group protein phosphorylation substrates into subgroups. As the example given in
Figure 1, 232 phosphorylation serine substrates are grouped into subgroups. When applying MDD to cluster the sequences of a positive set, a parameter, i.e. the minimum-cluster-size, should be set. If the size of a subgroup is less than the minimum-cluster-size, the subgroup is terminated to be divided. The MDD process terminates until all the subgroup sizes are less than the minimum-cluster-size.
Thereupon, the concept of the profile HMM was adopted to learn computational models from positive sets of phosphorylation sites. To evaluate the learned models, k-fold cross-validation and leave-one-out cross-validation were performed on them. After evaluating the models, the model with highest accuracy in each dataset was chosen.
For each kinase-specific positive set of the phosphorylated sites, the best performed model is selected and used to identify the phosphorylation sites within the input protein sequences by HMMsearch (9 (
link)). To search the hits of a model, HMMER returns both a HMMER bit score and an expectation value (
E-value). The HMMER bit score is used as the criterion to define a HMM match. We select the HMMER score as the criterion to define a HMM match. A search of a model with the HMMER score greater than the threshold
t is defined as a positive prediction, i.e. a HMM recognizes a phosphorylation site. The threshold
t of each model is decided by maximizing the accuracy measure during a variety of cross-validations with the HMM bit score value range from 0 to −10. For example, Supplementary Figure S1 depicts the optimization of the threshold of the HMM bit scores in the S_PKA model. The threshold of the S_PKA model is set to −4.5 to maximize the accuracy measure of the model.
When considering a MDD-clustered dataset, for example, MDD-clustered PKA catalytic serine (S_PKA), the HMMs are trained separately from the subgroups of the phosphorylated sites resulted by MDD. Each model is used to search in the given protein sequences for the phosphorylated sites. A positive prediction of a model group is defined by at least one of the models that makes a positive prediction, whereas a negative prediction is defined as all the models that make negative predictions.