A dataset of eluted ligands was obtained from Pearson et al. (17 (
link)). Also, a set of positive CD8 epitopes was downloaded from the IEDB. The epitope set was identified using the following search criteria “T cell assays: IFNg", "positive assays only", "MHC restriction Type: Class I". Only entries with fully typed HLA restriction, peptides length in the range 8–14 amino acids, and with annotated source protein sequence were included. Positive entries with a predicted rank score greater than 10% using NetMHCpan-3.0 were excluded to filter out likely noise (6 (
link)). For both the T-cell epitope and eluted ligand data sets, negative peptides were obtained by extracting all 8–14mer peptides from the source proteins of the eluted ligands and subsequently excluding peptides-MHC combination found with an exact match in the training data (both binding affinity and eluted ligand data sets). The final eluted data set contained 15,965 positive ligands restricted to 27 different HLA molecules, and the IEDB T cell epitope data set 1,251 positive T cell epitopes restricted to 80 HLA molecules.
A Frank value was calculated for each positive-HLA pair as the ratio between the number of peptides with a prediction score higher than the positive peptide and the number of peptides contained within the source protein. The Frank value is hence 0 if the positive peptide has the highest prediction value of all peptides within the source protein, and a value of 0.5 in cases where an equal amount of peptides has a higher and lower prediction value compared to the positive peptide.
An unfiltered eluted ligand data set was obtained from Bassani-Sternberg et al. (22 (
link)). This data sets consisted of eluted ligand data from 6 cell lines each with fully typed HLA-A, B and C alleles. A data set was constructed for each cell line, including all 8–13mer ligand as positives, and 5 times the total number of ligands random natural negatives for each length 8–13. That is if a data set contained 5,000 ligands, 5*5000 = 25,000 random natural peptides of length 8, 9, 10, 11, 12, and 13 were added as negatives arriving at a final data set with 155,000 (5000 + 6*25000) peptides.