To functionally annotate proteins regulated by ubiquitination, we downloaded a set of 5,884 verified ORFs (5817 sequences of length ≥50) from the SGD website and applied UbPred. A major challenge in finding proteins that are most likely to be ubiquitinated is a possibility that a direct application of UbPred to any proteome would favor longer proteins, as a consequence of <100% prediction accuracy. Thus, to extract a set of proteins with strongest predictions, we proceeded as follows.
First, a threshold t was determined such that only 100·p% of all prediction scores over all proteins were greater than t. For a sufficiently high t, or similarly, sufficiently low p, such scores can be considered as strong predictions of ubiquitination, which is supported by the low false positive rate in the bottom left-hand corner of the estimated ROC curve. Then, with a reasonable assumption, we introduced a null model in which a randomly selected lysine from any protein had 100·p% chance of being predicted as strong. Under this model, the number of strong predictions (with scores above threshold t) in each protein would be proportional to the number of lysines it contains. Therefore, using the null model assumption, the probability that, in a protein containing K lysines, the number of strong predictions that occurred by chance is k or greater, can be expressed as
where p is the probability that a randomly selected lysine has a strong prediction of being ubiquitinated. Thus, proteins with the lowest P-value P are the most likely to contain a disproportionately larger number of strong predictions than expected by chance. We considered these proteins to be the most strongly ubiquitinated proteins (i.e. over-ubiquitinated). The potential length dependence was thus eliminated since the P-values implicitly equalize the length factor. We selected the threshold of p = 0.1 and extracted all proteins with P < 0.05, Bonferroni corrected. In addition, since consecutive lysines may not be considered to be motionally independent (possibly invalidating null model assumptions), we note that a selection of the smaller samples of lysines from each protein did not significantly influence the results reported herein.
First, a threshold t was determined such that only 100·p% of all prediction scores over all proteins were greater than t. For a sufficiently high t, or similarly, sufficiently low p, such scores can be considered as strong predictions of ubiquitination, which is supported by the low false positive rate in the bottom left-hand corner of the estimated ROC curve. Then, with a reasonable assumption, we introduced a null model in which a randomly selected lysine from any protein had 100·p% chance of being predicted as strong. Under this model, the number of strong predictions (with scores above threshold t) in each protein would be proportional to the number of lysines it contains. Therefore, using the null model assumption, the probability that, in a protein containing K lysines, the number of strong predictions that occurred by chance is k or greater, can be expressed as
where p is the probability that a randomly selected lysine has a strong prediction of being ubiquitinated. Thus, proteins with the lowest P-value P are the most likely to contain a disproportionately larger number of strong predictions than expected by chance. We considered these proteins to be the most strongly ubiquitinated proteins (i.e. over-ubiquitinated). The potential length dependence was thus eliminated since the P-values implicitly equalize the length factor. We selected the threshold of p = 0.1 and extracted all proteins with P < 0.05, Bonferroni corrected. In addition, since consecutive lysines may not be considered to be motionally independent (possibly invalidating null model assumptions), we note that a selection of the smaller samples of lysines from each protein did not significantly influence the results reported herein.