We began computation of putative panorthologs for each set of genomes using NCBI BLASTP (release 2.2.16) to analyze all genes in all genomes for sequence similarity. We kept for later processing all BLAST hits within an E-value threshold of 1. These hits include each gene's self hit. We stored the E-value, bit score and alignment length for each hit. When running BLASTP, we used default parameters except for setting the E-value threshold and for setting the maximum number of hits to keep.
We next identified homologs as those gene pairs that had BLAST hits in both directions within a given scaled bit score threshold. We scaled the bit scores by the bit score of the self hit of the query gene. That is, scaledBitScore(A->B) = bitScore(A->B)/bitScore(A->A). This method has been used previously to identify conserved homologs among bacterial genomes and has been shown to be more stringent than criteria based solely on reciprocal best matches using E values [17] (link).
We then formed homolog families by including two genes in a family if they had been identified as homologs. Note that not all pairs of genes in a family need to be identified as homologs. For example, if A and B are homologs, and B and C are homologs, then A and C will be in the same family even if A and C have not been identified as homologs. Finally we identified the putative panorthologs as being the genes from homolog families with exactly one gene from each genome. For each set of genomes we kept the largest set of panorthologs found by computing the putative panorthologs while varying the scaled bit score threshold from .1 to .9 in .1 increments.
The following scaled bit score thresholds were used for genome sets A–E depicted in Fig. 1, followed by the number of putative panorthologs identified at that threshold: group A: threshold = 0.7, 4141 panorthologs; group B: 0.7, 3758, group C: 0.4, 2203, group D: 0.3, 902, group E: 0.2, 581. To produce groups d and e, the five Bordetella genomes were first analyzed by this method (0.5, 1592) as well as the five Xanthomonas genomes (0.5, 2450). The intersections of these Bordetella and Xanthomonas panortholog sets with groups b and c were used to produce groups d and e, respectively.
Free full text: Click here