In one of our studies the objective of the Agent is to generate molecules that are predicted to be active against a biological target. The dopamine type 2 receptor DRD2 was chosen as the target, and corresponding bioactivity data was extracted from ExCAPE-DB [33 (link)]. In this dataset there are 7218 actives (pIC50 > 5) and 343204 inactives (pIC50 < 5). A subset of 100,000 inactive compounds was randomly selected. In order to decrease the nearest neighbour similarity between the training and testing structures [34 (link)–36 (link)], the actives were grouped in clusters based on their molecular similarity. The Jaccard [37 ] index, for binary vectors also known as the Tanimoto similarity, based on the RDKit implementation of binary Extended Connectivity Molecular Fingerprints with a diameter of 6 (ECFP6 [38 (link)]) was used as a similarity measure and the actives were clustered using the Butina clustering algorithm [39 (link)] in RDKit with a clustering cutoff of 0.4. In this algorithm, centroid molecules will be selected, and everything with a similarity higher than 0.4 to these centroids will be assigned to the same cluster. The centroids are chosen such as to maximize the number of molecules that are assigned to any cluster. The clusters were sorted by size and iteratively assigned to the test, validation, and training sets (assigned 4 clusters each iteration) to give a distribution of 16 , 16 , and 46 of the clusters respectively. The inactive compounds, of which less than 0.5% were found to belong to any of the clusters formed by the actives, were split randomly into the three sets using the same ratios.
A support vector machine (SVM) classifier with a Gaussian kernel was built in Scikit-learn [40 ] on the training set as a predictive model for DRD2 activity. The optimal C and Gamma values utilized in the final model were obtained from a grid search for the highest ROC-AUC performance on the validation set.
Free full text: Click here