β values, or the percentage of CpGs at a given site that were methylated, were calculated for every sample at each CpG site. Observations with detection p-values > 0.05 were set to missing, and any CpG site with missing data was omitted from the analysis. The R program ComBat (8 (link)), included in the SVA package (23 ), was used to adjust for the chip on which samples were run. This method employs a parametric empirical Bayes framework to adjust data for batch effects; it is a particularly robust method for dealing with small samples sizes. To determine whether ComBat effectively removed batch effects, principal component analysis (PCA) was used to determine the top five principal components (PCs) present in the pre- and post-ComBat β values for Runs One and Two. Afterwards, the association between each PC and 1) the chip on which samples were run; and 2) “low” or “high” As group was determined using linear regression. In addition, Spearman’s correlation coefficients between β values in Runs One and Two were calculated for each CpG site, pre- and post-ComBat; we also calculated pre-Combat, between-run correlations stratified by Run One chip.
To investigate whether As exposure status was associated with differential DNA methylation in Runs One or Two, β values were transformed into M values (log2(β/(1−β)) prior to statistical analysis (24 (link)). SVA was used to calculate F-statistics and their associated p-values by comparing two nested linear models. The larger model contained As group, coded as a factor variable, as well as any other variables of interest, while the smaller model contained only an intercept and the non-As variables of interest. Q-values were then calculated, also in SVA. In this way, the number of differentially methylated sites (q<0.05) for Run One and Two were determined, both before and after ComBat was used to adjust for the chips on which samples were run. A clustering analysis was performed on the Run One data, using pre-ComBat β values and the heatmap function in R. The top 100 differentially methylated sites, determined by q-value, were included in the analysis. Color-coding for both exposure group and chip was implemented. The same process was performed for Run Two data, after adjusting models for chip, using ComBat, as well as two potential confounders: betelnut use and land ownership.
To investigate whether As exposure status was associated with differential DNA methylation in Runs One or Two, β values were transformed into M values (log2(β/(1−β)) prior to statistical analysis (24 (link)). SVA was used to calculate F-statistics and their associated p-values by comparing two nested linear models. The larger model contained As group, coded as a factor variable, as well as any other variables of interest, while the smaller model contained only an intercept and the non-As variables of interest. Q-values were then calculated, also in SVA. In this way, the number of differentially methylated sites (q<0.05) for Run One and Two were determined, both before and after ComBat was used to adjust for the chips on which samples were run. A clustering analysis was performed on the Run One data, using pre-ComBat β values and the heatmap function in R. The top 100 differentially methylated sites, determined by q-value, were included in the analysis. Color-coding for both exposure group and chip was implemented. The same process was performed for Run Two data, after adjusting models for chip, using ComBat, as well as two potential confounders: betelnut use and land ownership.