The benchmark dataset (Supplementary Data S1) used in this work was established by extracting data on antioxidant peptides of length 2–30 amino acids both derived from different protein sources (e.g., fish40 (link) and dairy41 (link)) and synthetic42 (link), obtained from various published articles and from the BIOPEP-UWM43 (link) database. Each peptide was binary labelled for the two classes, free radical scavenger (FRS) and chelator. The classes were labelled 1 (positive) if their source had measured/indicated an activity and otherwise 0 (negative). This extraction resulted in; 696 antioxidant peptides (685 FRS and 81 chelating, 70 of which have both activities) and 218 non-antioxidant experimentally-validated peptides, as seen in Table 1. Furthermore, to diminish homology bias while training, sequences were removed from both the positive and negative peptides so that no pair had more than 90% identity44 (link). All sequence identities in this paper were calculated using the Needleman–Wunsch algorithm45 (link) with the parameters; 1 for identical, 0 for dissimilar, − 10 for opening and extending gaps and 0 for end gaps.

Overview over the benchmark dataset.

FRSCheLFRS/CheLNon-AORandomTotal
AOdb61511702185001414
aodb < 90%60611702175001404

FRS, CHEL, FRS/CHEL and NON-AO are all experimentally-validated peptides obtained from various papers. RANDOM consists of peptides derived from the UniProt46 (link) database, with lengths between 2–30 amino acids. AODB < 90% is the number of peptides after removal of sequences, so no pair has more than 90% identity. FRS free radical scavenger, CHEL chelator, FRS/CHEL both FRS and chelator, NON-AO non-antioxidant.

Additionally, 500 random peptides with lengths between 2–30 amino acids, with the same length distribution as the positive dataset were extracted from random proteins derived from the UniProt46 (link) database. It was ensured that none of these peptides were identical to any peptide in the positive dataset. This amounted to a final, balanced benchmark dataset of 1404 peptides, consisting of 687 FRS and chelators, 717 peptides termed non-antioxidant and a positive to negative ratio of 0.94 and 0.11 for FRS and chelators respectively.
To improve generalization and achieve a robust accuracy of our model’s predictions on unobserved cases, a fivefold nested cross-validation approach was used29 (link). The fivefolds were created so that all folds contained similar number of positives and negatives, and FRS and chelators. Furthermore, a upper threshold for peptide identity was enforced, for any two peptides between different folds. Four partitions were made with a threshold of 60, 70, 80 or 90% identity between folds respectively.
Free full text: Click here