Data for the Atlas are selected from ArrayExpress Archive and selection is based on various criteria outlined earlier. As currently we are using only microarray data, our first consideration is whether sufficient array annotation is given to enable us to map the array design elements to existing gene identifiers. We use two routes for this mapping: we preferentially map array probe sequences to Ensembl genomes (15 (link)) or we attempt to map the design element annotation identifiers to gene annotation in UniProt database (16 (link)). Where re-annotation fails, experiments that are performed on such arrays cannot be included in the Atlas. The array re-annotation pipeline will be released as a software package, described and published separately (Sarkans et al., in preparation).
Experiments in ArrayExpress Archive that are performed on well-annotated arrays, which have high MIAME scores (2 ,17 (link)), where the EF/EFV annotation and sufficient replication criteria (as well as some other technical criteria not described here), and where normalized data are present, are annotated as ‘suitable for Atlas’. When all basic criteria are satisfied, experiment selection for the Atlas is motivated by the quality of annotation, use of standard platforms and large sample sizes, without any preference for any biological conditions. Recently, we started to produce themed Atlas data releases, e.g. species oriented or addressing a specific research domain, or by curating user-requested studies. Experiments selected for Atlas are then exported from the Archive. The submitter's; normalized data are used, hence we do not perform any renormalization. Prior to loading into the Atlas, annotations are harmonized, experimental descriptions checked for consistency and non-standard terms are standardized. Maps to EFO are added where the term required is present in the ontology. If terms are not in EFO, we examine source ontologies and provide a term name, definition and maps to external ontologies. The term is then placed in the EFO hierarchy that is optimized for the Atlas visualization.
Once data are loaded, statistical computations, as described in the previous section, are performed and for each new experiment, for each EF and EFV, for each gene the P-value is computed.
Currently, the Atlas contains data from nine species. Table 1 shows the number of assays and the number of studies (experiments) included from each. The experiments included in the Atlas together have more than 40 different EFs, covering over 4500 different EFVs. The distribution of the number of assays for the most frequently studied (at least 50 experiments for each factor) EFs and EFVs are given in Table 2.

Number of studies and assays for each species in the Atlas

SpeciesAssaysStudies
Homo sapiens13 703410
Mus musculus7539373
Rattus norvegicus4858133
Arabidopsis thaliana160788
Saccharomyces cerevisiae81343
Drosophila melanogaster79040
Schizosaccharomyces pombe45819
Danio rerio21413
Caenorhabditis elegans1665
Total30 1481124

Most frequently used EFs and the number of EFVs and studies for each factor

EFsEFVsStudies
Genotype389211
Compound treatment425196
Disease state214137
Organism part26798
Cell type16461
Growth condition12261
Strain or line22751
The method used in Gene Expression Atlas analytics allows us to examine trends in differential gene expression across all Atlas data. Figure 5A shows the distribution of proportions of differentially expressed genes across all experiments. There are approximately 400 experiments (from over 1000) with fewer than 10% of all genes showing differential expression; the mean proportion of genes differentially expressed in an experiment, according to our FDR criteria, is 25%. Further, when we examine the number of differentially expressed genes per factor (Figure 5B), we observe that the numbers are highest in the factors ‘observation’, ‘histology’, ‘cell line’, ‘generation’ and ‘organism part’. It appears that, broadly, across species, transcriptional activity is strongly driven by its context: by tissue (‘histology’, ‘organism part’ and, by extension, ‘cell line’), followed by developmental stage and then cell type, while the main extrinsic drivers of transcriptional activity such as xenobiotic responses (‘compound treatment’) and disease states contribute to differential expression to a smaller extent. We can also observe that the number of differentially expressed genes is largely independent of the number of EFVs (the median factor value count is around 3 EFVs).

Distributions of differentially expressed genes over (A) experiments and (B) EFs. Error bars in (B) mark the 25% and 75% quantiles in the differentially expressed gene count for each EF.