We used four previously described data sets to test our algorithms [17 (link)]. The first is a yeast data set containing 69,705 target PSMs and twice that number of decoy PSMs. These data were acquired from a tryptic digest of an unfractionated yeast lysate and analyzed using a four-hour reverse phase separation. Throughout this work, peptide were assigned to spectra by using SEQUEST with no enzyme specificity and with no amino acid modifications enabled. The next two data sets were derived from the same yeast lysate, but treated by different proteolytic enzymes: elastase and chymotrypsin. These data sets respectively contain 57,860 and 60,217 target PSMs and twice that number of decoy PSMs. The final data set was derived from a C. elegans lysate proteolytically digested by trypsin and processed analogously to the yeast data sets.
Each PSM was represented using the 17 features listed in Table 1. Note that, originally, Percolator used 20 features. In this work, we removed three features that exploit protein-level information, because of the difficulty of accurately validating, via decoy database search, methods that use this type of information. We also defined 20 additional features for each peptide, also defined in Table 1, corresponding to the counts of amino acids in the given peptide. Using these addition features yields a feature vector of length 37.