The training set was a subset of CMap consisting of gene-expression data and known DILI status for 190 small molecules (130 of which had been found to cause DILI in patients). The test set consisted of an additional 86 small molecules. The CMap gene-expression data were generated using Affymetrix gene-expression microarrays. In Phase I, we used the Single Channel Array Normalization (SCAN) algorithm [14 (link)]—a single-sample normalization method—to process the individual CEL files (raw data), which we downloaded from the CMap website (https://portals.broadinstitute.org/cmap/). As part of the normalization process, we used BrainArray annotations to discard faulty probes and to summarize the values at the gene level (using Entrez Gene identifiers) [15 (link)]. We wrote custom Python scripts (https://python.org) to summarize the data and execute analytical steps. The scripts we used to normalize and prepare the data can be found here: https://osf.io/v3qyg/.
For each treatment on each cell line, CMap provides gene-expression data for multiple biological replicates of vehicle-treated cells. For simplicity, we averaged gene-expression values across the multiple vehicle files. We then subtracted these values from the corresponding gene expression values for the compounds of interest. Finally, we merged the vehicle-adjusted data into separate files for MCF7 and PC3, respectively.
The SCAN algorithm is designed for precision-medicine workflows in which biological samples may arrive serially and thus may need to be processed one sample at a time [14 (link)]. This approach provides logistical advantages and ensures that the data distribution of each sample is similar, but it does not attempt to adjust for systematic differences that may be observed across samples. Therefore, during Phase II, we generated an alternative version of the data, which we normalized using the FARMS algorithm [16 (link)]—a multi-sample normalization method. This enabled us to evaluate whether the single-sample nature of the SCAN algorithm may have negatively affected classification accuracy in Phase I. Irrespective of normalization method, it is possible that batch effects can bias a machine-learning analysis. Indeed, the CMap data were processed in many batches. Therefore, for SCAN and FARMS, we created an additional version of the expression data by adjusting for batch effects using the ComBat algorithm [17 (link)].
Free full text: Click here