We create sets of promoter sequences for each of our key species, E.coli, S.cerevisiae and H.sapiens. Then, for each key species, we identify the orthologous genes in each of three related species and construct three additional sets of promoter sequences. Critically, in the related-species promoter sets, we use the gene name from the orthologous gene in the key species as the gene name for a promoter. This allows us to use the GO map for the key species when we compute the association scores for the related species. Our related species for E.coli (K12) are E.coli (CTF073), Salmonella typhimurium and Shigella flexneri 2a. Our S.cerevisiae related species are S.paradoxus, S.mikatae and S.bayanus. For H.sapiens, our related species are Mus musculus, Canis familiaris and Equus caballus.
Our definition of what a promoter is depends on the key species. For S.cerevisiae and H.sapiens, we define the promoter to be the upstream region [relative to the transcription start site (TSS) of a gene]. Because prokaryotes organize their genes into transcriptional units and operons that are transcribed together, for E.coli we define promoters to be the sequence upstream of operons, rather than of genes. We take operon information for E.coli K12 from RegulonDB v6.2 (Gama-Castro et al., 2008 (link)).
To identify orthologous genes in species related to E.coli, we use the Enterobacter Genome Browser (http://engene.fli-leibniz.de/) to search for best pairwise Blast hits to E.coli K12 genes. For simplicity, we assume that the operons are not altered across the species, i.e. the genes and their order stay the same in an operon across closely related species. To identify orthologous genes in S.cerevisiae relatives, we use the mappings from Kellis et al. (2003 (link)). To identify genes orthologous to H.sapiens genes in related species, we use one-to-one ortholog gene maps obtained from Biomart (Smedley et al., 2009 (link)).
To create the promoter sequence sets for E.coli and S.cerevisiae and related species, we use the RSAT sequence extraction tool (Thomas-Chollier et al., 2008 (link)). We study varying the size of the upstream region, as well as allowing it to overlap upstream open reading frames (ORFs). We refer to the truncated promoters as the ‘intergenic’ set, and to the promoters that (may) overlap upstream ORFs as the ‘full’ set. For H.sapiens and related species, we define the promoter to be the 1000 bp upstream of the TSS, and extract them using Biomart (Smedley et al., 2009 (link)).