All statistical analyses were accomplished using R program language. Gene expression data were processed and normalized using Bioconductor Affy package, based on the Robust Multichip Average (RMA) method[5] (link) for single-channel Affymetrix chips. All 22,283 probe sets based on RMA summary measure were used in class comparison analyses.
Average linkage hierarchical clustering of samples was based on one minus Pearson correlation as the dissimilarity metric.
An ANOVA analysis adjusting for sex was used to test whether genes were differentially expressed between smoking groups (C/N and F/N), between tumor tissue and non-tumor tissue (T/NT), or by pack years of cigarette smoking. Further analyses adjusted by tumor grade or excluding 6 subjects with emphysema or chronic bronchitis or 3 subjects who received chemotherapy prior to the study were conducted, with essentially unaltered results. For analyses including paired tissues (T/NT tissue samples from the same subjects), a linear mixed effects model was used to account for intra-person correlation.
To limit false positive findings, genes were considered statistically significant if their p-values were less than the stringent threshold of 0.001. Under the null hypothesis of no difference in expression profiles, and considering the analysis of 22,283 probes, we expect that by chance the average number of false positive findings will be ≤23. We used the Benjamini-Hochberg[2] procedure to calculate the False Discovery Rate (FDR). We further restricted significant genes to those which showed at least 1.5 fold ratio of geometric means of expression between two groups. Gene selection based on p<0.001 (two-sided) and fold-change >1.5 are referred to as “stringent criteria”.
The Cox Proportional Hazards model[6] was used to estimate the effect of gene expression changes in C/N on survival from lung cancer in smokers. Of the 74 subjects included in this study (all stages), 34 (22 smokers) were alive, and 40 (32 smokers) were deceased as of May 2007. Among the deceased subjects, 36 died of lung cancer. The remaining 4 (2 smokers) died of other cancers and were censored at time of death in the analysis. The time from lung cancer to death or date of last follow-up was between 28 days and 5.0 years for the deceased subjects, and 3.7 and 5.7 years for the subjects alive in May 2007. The relative risk of gene expression was defined as the hazard ratio associated with one standard deviation change of the expression. Analyses were adjusted for stage, sex, and smoking. Age was similarly distributed across the groups and was not adjusted for in the analysis.
Free full text: Click here