We curated data from eight publicly available microarray studies of lung cancer (Supplementary Table S1). For each study, we used the information regarding each sample's cancer subtype, as well as any information regarding patient sex, age and smoking status, as provided. For GSE11969, smoking history was converted from Brinkman index (number of cigarettes per day multiplied by number of years of smoking) to either ‘current’ (if Brinkman index was greater than zero) or ‘never’ otherwise. For visualizing the samples using t-SNE, we used samples corresponding to cancer subtypes that were present in at least two studies. For our meta-analysis, we used only the samples in each data set that were histologically defined as adenocarcinoma (AD), squamous cell carcinoma (SQ), small cell lung carcinoma (SCLC), or carcinoid (CAR). The discovery data sets were selected to include a variety of microarray platforms and to have a sufficient number of samples of SCLC and CAR. Because the Bhattacharjee data set is very heavily biased toward AD, but also has samples from the other subtypes, we included only 60 of the AD samples. Altogether, the merged discovery data comprised 639 samples and 7200 genes that were present in all five discovery data sets. We used α = 0.9 for the elastic net penalty, where α = 0 corresponds to ridge (L2-norm penalty) and α = 1 corresponds to lasso (L1-norm penalty). Lower values of α led to a classifier with more genes, but with identical performance. We always set glmnet's intercept option to true, although setting it to false did not appreciably affect the results. When calculating accuracy, the predicted class for each sample was taken to be the class with the highest probability.
Free full text: Click here