A preliminary set of orthologs was defined by identifying unique pairwise reciprocal best hits, with at least 80% similarity (∼85% identity) in amino acid sequence and less than 20% difference in protein length. The analysis of orthology was made for every pair of E. coli/Shigella genomes. The core genome, consisting of genes ubiquitously found among all strains of the species, was defined as the intersection of pairwise lists.
For every pair of genomes this list of persistent orthologs was then supplemented, with attention to conservation of gene order. Because (i) few rearrangements are observed at these short evolutionary distances, and (ii) horizontal gene transfer is frequent, genes outside conserved blocks of synteny are likely to be xenologs or paralogs. Hence, we combined the homology analysis (protein sequence similarity ≥80%, ≤20% difference in protein length) with the classification of these genes as either syntenic or nonsyntenic, for positional orthology determination. The analysis was made for every pair of E. coli/Shigella genomes. The definitive list of orthologs of the pan-genome was then defined as the union of pairwise lists.
A syntenic block was defined as a set of consecutive pairs of genes in the core genome. Conserved order gene blocks are obtained by comparison of the localisation of best bi-directional hit pairs in the core genome, adopting a window size of one gap.
These lists were also used to perform gene accumulation curves using R, which describe the number of new genes and genes in common, with the addition of new comparative genomes (Figure 1). The procedure was repeated 1000 times by randomly modifying genome insertion order to obtain median and quartiles.
Free full text: Click here