We extracted small toxins (proteins/peptides) from different databases and studies that include ATDB [15] (link), Arachno-Server [19] (link), Conoserver [20] (link), DBETH [16] (link), BTXpred [17] (link), NTXpred [18] (link), and SwissProt [21] (link). We removed all proteins/peptides having more than 35 residues or any non-natural amino acid. As a result, 1805 unique toxic proteins/peptides were obtained. By employing the similar criteria, toxic proteins/peptides were also searched in SwissProt database using keyword KW800 (keyword 800 stands for toxin as molecular functions). A total of 803 toxic proteins, having length less than 35 amino acids were obtained. It is possible that many toxic peptides obtained from various databases could also be present in SwissProt. Therefore, identical toxic proteins/peptides were removed and finally we got 303 unique toxic proteins/peptides from SwissProt. These proteins/peptides were considered as toxic peptides or positive examples. Though it is possible to extract well-annotated or experimentally validated toxic peptides, but it is difficult to obtained non-toxic peptides. Therefore, to create a negative dataset, we have searched protein/peptide sequences in UniProt using keywords NOT KW800 NOT KW20 (keyword 800 and 20 stand for toxin and allergen as molecular functions). Proteins/peptide sequences having length less than 35 amino acids were extracted. After removing sequences with non-natural amino acids, two types of negative datasets were created; first dataset consists of 3893 sequences from SwissProt (NOT KW800 NOT KW20) and second dataset consists of 13541 sequences from TrEMBL (keyword NOT KW800 AND KW33090) [21] (link). While searching non-toxins in TrEMBL, additional keyword plant proteins were applied as search criteria as most of the plants are edible and therefore, the probability of plant proteins/peptides to be toxic is very low. Above toxic and non-toxic peptides/proteins were used to generate various datasets for training, testing and evaluating our models developed for predicting toxicity of peptides (Figure 1 ). Following is the brief description of these datasets:
Full text: Click here