We applied the proposed ComBat-seq approach on an RNA-seq data from a perturbation experiment using primary breast tissue attempting to profile the activity levels of growth factor receptor network (GFRN) pathways in relation to breast cancer progression (13 (link),14 (link)). We took a subset of experiments, which consists of three batches. In each batch, the expression of a specific GFRN oncogene was induced by transfection to activate the downstream pathway signals (different oncogene/pathway in each batch). Controls were transfected with a vector that expresses a green fluorescent protein (GFP), and GFP controls were present in all batches. More specifically, batch 1 contains five replicates of cells overexpressing HER2, and 12 replicates for GFP controls (GEO accession GSE83083); batch 2 contains six replicates of each for EGFR and its corresponding controls (GEO accession GSE59765); batch 3 consists of nine replicates of each for wild-type KRAS and GFP controls (GEO accession GSE83083).
Note that this is a challenging study design for batch effect adjustment: the control samples are balanced across batches, while each of the 3 kinds of treated cells, with different levels of biological signals, is completely nested within a single batch. A favorable adjustment would pool control samples from the three batches, while keeping all treated cells separated from the controls and from each other.
We combined the three batches and performed batch correction. Among the batch correction methods considered, only RUV-seq, the original ComBat used on logged and normalized data and ComBat-seq output adjusted data. We apply these methods to address the batch effects in the pathway signature dataset. We compared ComBat-seq with the other methods, both qualitatively through principal component analysis (PCA) and quantitatively with explained variations by condition and batch.
The ‘one-step’ approach and SVA-seq are not considered in PCA because they do not generate adjusted data after batch correction. For RUV-seq, we do not know which genes are appropriate for negative control genes, unlike in the simulation studies. Therefore, we used the RUVs method, which is more robust to the choices of negative control genes than RUVg (3 (link)). We computed the least DE genes within each batch for the 3 activated pathways (FDR > 0.95), and took the overlapping genes across pathways as the negative controls.
Free full text: Click here