For each treatment on each cell line, CMap provides gene-expression data for multiple biological replicates of vehicle-treated cells. For simplicity, we averaged gene-expression values across the multiple vehicle files. We then subtracted these values from the corresponding gene expression values for the compounds of interest. Finally, we merged the vehicle-adjusted data into separate files for MCF7 and PC3, respectively.
The SCAN algorithm is designed for precision-medicine workflows in which biological samples may arrive serially and thus may need to be processed one sample at a time [14 (link)]. This approach provides logistical advantages and ensures that the data distribution of each sample is similar, but it does not attempt to adjust for systematic differences that may be observed across samples. Therefore, during Phase II, we generated an alternative version of the data, which we normalized using the FARMS algorithm [16 (link)]—a multi-sample normalization method. This enabled us to evaluate whether the single-sample nature of the SCAN algorithm may have negatively affected classification accuracy in Phase I. Irrespective of normalization method, it is possible that batch effects can bias a machine-learning analysis. Indeed, the CMap data were processed in many batches. Therefore, for SCAN and FARMS, we created an additional version of the expression data by adjusting for batch effects using the ComBat algorithm [17 (link)].