Provided training set. The data that were suggested to be used by the participants as a training set to develop and optimize their models was derived from ToxCast™ and Tox21 programs (Dix et al. 2007 (
link); Huang et al. 2014 (
link); Judson et al. 2010 (
link)). Concentration-response data from a collection of 18
in vitro HTS assays exploring multiple sites in the mammalian ER pathway were generated for 1,812 chemicals (Judson et al. 2015 (
link); U.S. EPA 2014c ). This chemical library included 45 reference ER agonists and antagonists (including negatives), as well as a wide array of commercial chemicals with known estrogen-like activity (Judson et al. 2015 (
link)). A mathematical model was developed to integrate the
in vitro data and calculate an area under the curve (AUC) score, ranging from 0 to 1, which is roughly proportional to the consensus AC50 value across the active assays (Judson et al. 2015 (
link)). A given chemical was considered active if its agonist or antagonist score was higher than 0.01. In order to reduce the number of potential false positives this threshold can be increased to 0.1.
Prediction set. We identified > 50,000 chemicals [at the level of Chemical Abstracts Service Registry Number (CASRN)] for use in this project as a virtual screening library to be prioritized for further testing and regulatory purposes. This set was intended to include a large fraction of all man-made chemicals to which humans may be exposed. These chemicals were collected from different sources with significant overlap and cover a variety of classes, including consumer products, food additives, and human and veterinary drugs. The following list includes the sources used in this project:
This virtual chemical library has undergone stringent chemical structure processing and normalization for use in the QSAR modeling study (see “Chemical Structure Curation”) and made available for download on ToxCast™ Data web site under CERAPP data (
https://www3.epa.gov/research/COMPTOX/CERAPP_files.html, PredictionSet.zip) (U.S. EPA 2016 ), is intended to be employed for a large number of other QSAR modeling projects, not just those focused on endocrine-related targets.
Experimental evaluation set. A large volume of estrogen-related experimental data has accumulated in the literature over the past two decades. The information on the estrogenic activity of chemicals was mined and curated to serve as a validation set for predictions of the different models. For this purpose,
in vitro experimental data were collected from different overlapping sources, including the U.S. EPA’s HTS assays, online databases, and other data sets used by participants to train models:
The full data set consisted of > 60,000 entries, including binding, agonist, and antagonist information for ~ 15,000 unique chemical structures. For the purpose of this project, this data set was cleaned and made more consistent by removing
in vivo data, cytotoxicity information, and all ambiguous entries (missing values, undefined/nonstandard end points, and unclear units). Only 7,547 chemical structures from the experimental evaluation set that overlapped with the CERAPP prediction set, for a total of 44,641 entries, were kept and made available for download on the U.S. EPA ToxCast™ Data web site (
https://www3.epa.gov/research/COMPTOX/CERAPP_files.html, EvaluationSet.zip) (U.S. EPA 2016 ). The non-CERAPP chemicals were excluded from the evaluation set (see “Chemical Structure Curation” section). Then, all data entries were categorized into three assay classes: (
a) binding, (
b) reporter gene/transactivation, or (
c) cell proliferation. The training set end point to model is the ER model AUC that parallels the corresponding individual assay AC
50 values, and therefore all units for activities in the experimental data set were converted to μM to have approximately equivalent concentration–response values for the evaluation set. Chemicals with cell proliferation assays were considered as actives if they exceeded an arbitrary threshold of 125% proliferation. For entries where testing concentrations were reported in the assay name field, those values were converted to μM and considered as the AC
50 value if the compound was reported as active. All inactive compounds were arbitrarily assigned an AC
50 value of 1 M.
Mansouri K., Abdelaziz A., Rybacka A., Roncaglioni A., Tropsha A., Varnek A., Zakharov A., Worth A., Richard A.M., Grulke C.M., Trisciuzzi D., Fourches D., Horvath D., Benfenati E., Muratov E., Wedebye E.B., Grisoni F., Mangiatordi G.F., Incisivo G.M., Hong H., Ng H.W., Tetko I.V., Balabin I., Kancherla J., Shen J., Burton J., Nicklaus M., Cassotti M., Nikolov N.G., Nicolotti O., Andersson P.L., Zang Q., Politi R., Beger R.D., Todeschini R., Huang R., Farag S., Rosenberg S.A., Slavov S., Hu X, & Judson R.S. (2016). CERAPP: Collaborative Estrogen Receptor Activity Prediction Project. Environmental Health Perspectives, 124(7), 1023-1033.