Predictive polypharmacology profiling was undertaken using Bayesian activity models, based on our previously published approach9 . The Bayesian method for polypharmacology profiling was chosen as it provided both good performance on noisy datasets and a high speed of calculation51 (link). High confidence models were built using ChEMBL (release 1). Activity data were filtered to keep only activity endpoint points that had either IC50, ki or EC50 values and where the ChEMBL confidence score was at least 7 (protein assignment was direct or homologue). A compound was considered active when the mean activity value was below 10μM. All inactive compounds were assigned to the target ‘none’. Following this procedure 133,061 compounds remained with 215,967 activity endpoints, which were used for model building. Multiple category Laplacian-modified naïve Bayesian models were built with ECFP6 representations52 for 784 targets. For each model the data were split in two for the validation step: compounds were clustered and assigned a cluster number. Clusters with an odd number were assigned to the test set, and the clusters with even number were assigned to the training set. Models were built with the training set, and the test set was scored. The training set was scored using its own model as comparison. Finally a model was built with all data and scored against itself, the training set and whole set should provide similar validation statistics. Statistics on the performance of the models are described in Supplementary Table 12. The results for the model containing all 785 targets the results were very similar to the models for the receptor subsets. Two analyses were used to assess the performance of the different models. The first analysis provides an overall score and does not need to specify a cut-off for distinguishing active from inactive compounds. The area under the Receiver Operating Characteristic (ROC) Curve (AUC) provides an indication of the ability of the model to prioritize active compounds over inactive compounds. The ROC curve is the plot of true positive rate versus false positive rate. However it did not provide information on early enrichment, which was important in studies such as the present one where only the top ranking compounds were considered. Therefore the Boltzmann-Enhanced Discrimination of ROC (BEDROC)53 was used, which solves the early enrichment issue by adding a weight to compounds recognized early. BEDROC was derived from the Robust Initial Enhancement (RIE), and the Sum of log of ranks test (SLR)54 (link) which provided a statistical test to assess which method performs better than random. The percent of active compound retrieved in the top 5% is also calculated (Recall =5%). The second analysis required a cut-off to make the distinction between active and inactive as they varied with the rank of the compounds. For each model, the specificity (true negative rate), sensitivity (true positive rate), false positive rate, false negative rate, precision, F-measure and Matthews Correlation Coefficient (MCC) were calculated at different cut-off values. The cut-off providing the best MCC score was used, as it was shown to provide better performance55 (link) (Supplementary Fig. 12). The quality of the models was assessed using an internal leave-one-out validation: one compound was part of the test set, and was scored using the remaining data as the training set. Then the area under the ROC curve was calculated (Supplementary Fig. 13). A cut-off score to minimize the sum of the percent misclassified for category members and for category non-members was calculated and used to classify compounds in the contingency table.
An All Data Model for dopamine receptors only was built using data from pre release of ChEMBL (StARLite version 31), with similar numbers of compounds and endpoints. The model was built without considering the confidence level of target assignment to gather as much data as possible. This model was used for initial calculation on the evolution of the isoindole series and the 2,3-dihydro-indol-1-yl series. The quality of the models was assessed using the same procedures as described above. The results from the All Data Models and the High Confidence Models were very similar (e.g. D2 model R2=0.998, D4 models R2=0.984).