In RNA-Seq analysis, the q value is an adjusted P value, taking into account the FDR. A P value of 0.05 indicates that 5% of all tests will be false positives. An FDR-adjusted P value of 0.05 implies that 5% of the tests found to be statistically significant (for example, by P value) will be false positives. Therefore, FDR has a greater power than P value, and we have mainly relied on FDR to gauge DEGs. To define DEGs, we used very stringent statistic threshold of ≥2 FC and FDR <0.05 to generate manageable lists in order for us to perform manual curation to classify each DEG in each cell type into non-redundant functional categories. Using the above statistical threshold, we identified a consensus of 853 DEGs upregulated in basal and 940 DEGs in luminal cells (Supplementary Data 1). Notably, to avoid the misunderstanding that genes not presented in the ‘stringent' lists are not DEGs, we also listed genes that passed a relatively loose but still statistically significant cutoff (that is, FC≥2 and P<0.05) in Supplementary Data 1. This latter cutoff resulted in more DEGs in basal (n=1,432) and luminal (n=1,548) cell populations (Supplementary Data 1). For example, FGFR3 (Fig. 3a) and some Pol I complex subunits (Fig. 4e; for example, POLR1B (P=0.006, FDR=0.069), POLR1C (P=0.006, FDR=0.069), NIP7 (P=0.005, FDR=0.060), and ESF1 (P=0.006, FDR=0.063) were not in the list with FDR<0.05, but were in the list with P<0.05. For Fig. 3a, the reason we chose FGFR3 (P=0.006, FDR=0.07) for demonstration was its abundance over other differentially expressed FGFRs (for example, the mean FPKM in basal cell, FGFR3=11 versus FGFR4=1), although its FDR was slightly above the stringent cutoff of 0.05. To get more reliable and manageable results, we mainly used the fewer DEGs lists for bioinformatics analysis.
For Fig. 1i, we identified the top 50 putative marker genes specific for each lineage inferred from transcriptomes based on both relative differential expression (FC) and absolute expression levels (normalized read counts). To increase the confidence of this selection, we scanned the genes from the stringent DEGs lists. Thus, the genes showing high-RNA expression (normalized read counts>300) in both cell types, regardless of the differential FC, would be excluded due to the high probability of protein expression in both cell types. Likewise, genes showing high FC difference between the two cell types but having minimal RNA expression in either cell type (that is, normalized read counts<300, indicating the less probability of robust protein expression) would also be eliminated. Note that normalized read counts of 300 (quite high) is an arbitrary set-up to increase the reliability of this selection. Using these criteria, we could identify >100 genes unique for each cell type, and the top 50 were shown in Fig. 1i. Notably, FGFR3 is not in the top 50, but we included it in Fig. 1i owing to the experimental data and for the reasons discussed above.
Free full text: Click here