The dataset is constructed by extracting non-redundant linear B-cell epitopes from IEDB [9] (link), because it is frequently updated and has a large number of linear epitopes. Total of 65,456 B-cell linear epitopes are downloaded from IEDB (version June 11th, 2012). The identical epitopes and those possibly related to T-cell are removed. The full-length sequences of corresponding epitopes are also collected. The various lengths of epitope sequences, including 10AA, 12AA, 14AA, 16AA, 18AA, and 20AA, are extracted by trimming the long experimental measured epitopes or attaching more amino acid residues to both ends of short epitopes according to the full-length sequences. For a given length, epitope sequences with ≥30% similarity, measured by BLAST [24] (link), are clustered together and only one of them is kept as an epitope sequence in the dataset. Finally, the dataset for each length has a total of 4925 non-redundant epitope sequences. For the negative dataset, the same numbers of equal-length sub-sequences are extracted from the non-epitopic segments in the corresponding antigen sequences.
Free full text: Click here