We have designed this approach with whole proteome analyses in mind, so as a proof-of-concept, we applied the approach to the study of host adaptation in
Salmonella enterica, using
S. Enteritidis and
S. Gallinarum as a specific example. For a summary of the workflow of this portion of the analysis, see
Supplementary Figure 1. Genomes for
Salmonella Enteritidis str. P125109 (AM933172.1) and Salmonella Gallinarum str. 287/91 (AM933173.1) were retrieved. Custom gene models were constructed by searching each
S. Enteritidis protein coding gene against the Uniref90 database. Profiles were built using sequences showing 40% or greater sequence identity to the original query protein.
In order to assess whether models built from a small number of sequences performed well, we tested the performance of models using proteins from the humsavar database (
http://www.uniprot.org/docs/humsavar), which catalogs human polymorphisms and disease variants for a wide range of proteins which have different levels of representation in Uniprot90, and therefore result in profile models built from varying numbers of sequences. We built custom models for each protein in the database, then separated the models into classes based on both number of sequences and effective sequence number. We then computed AUC values for predictions made on variant data from each class. Results from this test can be found in
Supplementary Table S2. We saw no classes where custom models performed worse than Pfam models, so we did not filter models built from few sequences. We did, however, use Pfam models for those proteins with no homologs in Uniref90 rather than build a model from the query sequence, as we felt this could bias results towards the reference species.
To assess the quality of our analysis, we wished to compare our results to those of Nuccio and Bäumler (2014) (
link), so we used the ortholog calls provided in their supplementary material. Genes represented by a custom model were scored using the appropriate model (
n = 3154), all others were scored using Pfam domain models. hmmsearch was used for scoring. If hits to multiple Pfam models occurred, any overlapping hits were competed based on
E-value. Orthologs with incompatible Pfam domain architectures (
n = 32) were excluded from scoring, but counted as hypothetically attenuated coding sequences (HACs) in
Figure 2A if they involved a loss of domain in one serovar. We anticipate that the variance of delta-bitscores will increase with evolutionary distance, so rather than establish a fixed scoring cutoff for HACs, we identify loss-of-function mutations using an empirical distribution. We set a score threshold at which 2.5% of genes on the least dispersed side of the distribution would be called as HACs. If the two bacteria show an equal rate of protein function loss, this would result in 5% of orthologous proteins being called as HACs, however if one proteome shows a greater degree of functional degradation, this will result in a greater proportion of orthologs being classified as HACs.
Functional classification was performed using pathways from the KEGG database (Kanehisa
et al., 2016 (
link)). We grouped genes into four categories: those present in a pathway but with no ortholog in the other serovar; genes identified as hypothetically disrupted coding sequences (HDCs) by Nuccio and Bäumler (2014) (
link); genes identified by our DBS method as HACs, but not as HDCs; and finally genes not identified as non-functional by either method. dN/dS values were calculated using PAML (Yang, 1997 (
link)), and for the comparison of DBS and dN/dS, genes were filtered for those with dN > 0 and dS > 0.0001. Correlations between measures were computed using a Spearman’s rho statistic (R package cor).
In our investigation of multiple
S. enterica isolates, for all orthologous groups with a gene present in
S. Enteritidis, scores were collated, and if individual scores were significantly different to the median score for all isolates, we identified these proteins as HACs. Significant difference was calculated in a similar way to the pairwise comparisons, with the score corresponding to the most extreme 2.5% of delta-bitscores on the least dispersed side of an empirical distribution being used as the cutoff.