The numerical data on 113 piperidine derivatives (pyridine-substituted piperidines, tertiary alcohol-bearing piperidines, spirocyclic piperidines, and isoxazole-containing piperidines) were taken from the literature [9 (link)]. The activity is expressed as −logIC50 or pIC50 [9 (link)]. The set of compounds is split into (i) active training (≈25%), (ii) passive training (≈25%), (iii) calibration (≈25%), and (iv) validation sets (≈25%). Each set has a defined task. The active training set is used to build the model; molecular features extracted from the simplified molecular-input line-entry system (SMILES—which represents the structure) [28 (link),29 (link),37 (link)], of the active training set are involved in the Monte Carlo optimization to provide correlation weights for the above features, which provide the largest target function value on the active training set. The passive training checks whether the model for the active training set is satisfactory for SMILES that were not involved in the active training set. The calibration set should detect when overtraining (overfitting) starts. The validation set provides the possibility to assess the predictive potential of a model since the data from the validation set is unknown while building up a model. Our experience with CORAL shows that equal distribution over the four sets mentioned is likely the most rational strategy.
At the beginning of the optimization, the correlation coefficients between the experimental values of the endpoint and the descriptor simultaneously increase for all sets, but the correlation coefficient for the calibration set reaches a maximum; this is the start of overtraining, and further optimization leads to a decrease of the correlation coefficient for the calibration set. Optimization should be stopped when overtraining starts. After stopping the Monte Carlo optimization procedure, the validation set is needed to assess the model’s predictive potential.
Free full text: Click here