We judiciously added publicly available data to PeptideAtlas for the specific purpose of increasing the number of protein identifications. Our aim was to obtain a large number of new identifications by adding a moderate amount of data. First we added two large plasma datasets and one large cell line dataset that had recently been contributed. We then looked for promising data in the GPMDB using two strategies: (a) reviewing Datasets of the Week, which tend to be high quality datasets, and selecting those which were very high quality, used new MS technology, had low-complexity samples due to a filtering method, or used cell types or tissue types not yet in PeptideAtlas, and (b) using an automated process to select GPMDB datasets containing many higher-confidence identifications for proteins not yet in PeptideAtlas. We also considered all articles published in Molecular and Cellular Proteomics that referenced the Tranche data repository
16 in the main text, and selected from these datasets using the same criteria we used for GPM Datasets of the Week.
We selected a total of 27 datasets and were able to obtain 17 in full or almost in full (four from the authors, two from PRIDE, and 11 from Tranche) and four in large part (from Tranche). The remaining six datasets had been deposited in Tranche but could not be retrieved after multiple attempts, emphasizing the need for a stably funded publicly accessible repository for raw mass spectrometry data. One of the 17 full datasets was available only in Scaffold (Proteome Software) format and was not usable in our pipeline. Of the 20 full or partially downloaded datasets, 17 could be processed fully or partially using X!Tandem
17 (link) + K-score
18 (link). These, along with the two large plasma datasets and the large cell line dataset, were added to the Human PeptideAtlas. All twenty are listed in
Table S1, Supporting Information.
Among the added datasets were several that were expected to provide coverage of protein categories shown to be under-represented in PeptideAtlas by Gene Ontology analysis (data not shown), including samples of vitreous humor to increase coverage of proteins of sensory perception, seminal plasma to increase coverage of proteins of the reproductive system, a dataset identifying new integral membrane proteins, and experiments targeting signaling proteins. Other datasets were selected to cover additional sample types not previously included in PeptideAtlas (e.g. mitotic spindle, nucleosome, and colorectal tissue).
These datasets, along with all the datasets we had included in the previous build, were processed through the latest PeptideAtlas build pipeline
19 (link) to produce a final protein set with an FDR close to 1%. Briefly, all datasets were searched against a target-decoy sequence database consisting of the International Protein Index database
20 (link) (IPI) and cRAP common contaminants (
www.thegpm.org/crap), plus one decoy sequence for each target entry. Results were processed using the Trans-Proteomic Pipeline
10 (link). Identified peptides were mapped to a protein sequence database that included IPI v3.71
20 (link), Ensembl v67.37
11 , and the 2012_05 release of Swiss-Prot
21 (link), 22 (link), including splice variants and representing 20,244 protein-coding genes. A PSM (peptide-spectrum match) FDR threshold of 0.0002 was applied to each dataset to yield a list of 218,799 distinct identified peptides and a protein-level FDR of 0.8% as computed by Mayu
13 (link). See
Table 1 for comparison with previous build.
12,629 Swiss-Prot entries were found to contain at least one identified peptide in either its canonical form or one of its variant forms. (Thirty-six entries identified only by semi-tryptic or non-tryptic peptides are not included in this tally.) These entries formed the list referred to herein as
PA-seen and the remaining 7614 entries formed the list
PA-unseen. Note that in some cases two or more proteins in the PA-seen list have identical or overlapping sets of identified peptides. The PA-seen list is not intended to be a parsimonious (minimal-redundancy) protein list, but to contain all Swiss-Prot entries with any peptide evidence in this atlas build.
2397 distinct peptides, or about 1% of the total, mapped only to a sequence in either IPI or Ensembl, and not to any Swiss-Prot sequence. A parsimonious mapping of these peptides covers a total of 1291 IPI or Ensembl identifiers.
Farrah T., Deutsch E.W., Hoopmann M.R., Hallows J.L., Sun Z., Huang C.Y, & Moritz R.L. (2012). The state of the human proteome in 2012 as viewed through PeptideAtlas. Journal of proteome research, 12(1), 162-171.