From the standardized molecule structures, InChI keys were calculated and used to remove duplicates in the dataset. In the case of multiple LD50 values measured for one compound, the lowest dose value was kept to represent the worst-case toxicity of a compound. Six toxicity classes were defined based on the GHS classification scheme using the LD50 thresholds of 5, 50, 300, 2000 and 5000 mg/kg body weight. Each compound of the dataset was represented using a concatenated fingerprint consisting of the ‘FP2’ and ‘FP4’ fingerprints of Mychem (
In addition to the similarity search, the prediction method takes into account the presence of toxic fragments. All compounds in the database were fragmented using RECAP (20 (link)) as well as the in-house method ROTBONDS (21 (link)). The occurrence of each distinct fragment in molecules of the prediction dataset was tested using its SMILES string, computed with JChem 6.1.3 (November 2013) in a substructure search which was implemented using Open Babel's (19 (link)) fast search. To determine fragments over-represented in the most toxic classes, a propensity analysis (22 (link)) was performed. Propensity scores (PS) were calculated for every fragment and toxicity class. Toxic fragments were defined as those showing a PS above a threshold of 3 in classes I, II or III, and a PS below 1 in classes IV–VI. Based on these conditions, a total number of 1591 and 1580 fragments specific to toxicity classes I–III, generated with the ROTBONDS and RECAP fragmentation method, respectively, were contemplated for prediction.