Output files from Illumina MiSeq were first run through FastQC (Andrews et al. 2018 (
link)) to check read quality. The paired-end reads were merged using PEAR (Stamatakis et al. 2014 (
link)) set to a minimum assembly length of 150 base pairs reads allowing for high quality scores at both ends of the sequence. Adapters were trimmed from the ends of the antibiotic resistance genes' coding sequence using Trimmomatic (Bolger et al. 2014 (
link)). Enrich2 (Rubin et al. 2017 ) was used to count the frequency of each allele for use in calculating selection coefficients and associated statistical measures. We set Enrich2 to filter out any reads containing bases with a quality score below 20, bases marked as N, or mutations at more than one codon.
Fitness of an allele (
wi) was calculated from the enrichment of the synonyms of the wild-type gene (
), the enrichment of allele
i (
) and the fold increase in the number of cells during the growth competition experiment (
r) as described by Equation
1. We utilize the frequency of wildtype synonymous alleles as the reference instead of the frequency of wildtype because wildtype synonyms occurred more frequently in the library and wildtype sequencing counts are more prone to being affected by the artifact of PCR template jumping during the preparation of barcoded amplicons for deep-sequencing. Detailed derivations of the following equations (Equations
1–6) can be found in our previous work (Mehlhoff et al. 2020 (
link)).
We calculate the variance in the fitness as
where the frequency of allele (
fi) is calculated from counts of that allele (
ci) and the total sequencing counts (
cT).
From the variance in fitness, we calculated a 99% confidence interval. Additionally, we calculated a
P-value using a 2-tailed test. Details of the Z-score and
P-value equations are available in Mehlhoff et al. (2020) (
link).
We estimated the number of false positives that would be included at
P < 0.01 and
P < 0.001 significance in order to correct for multiple testing (Storey and Tibshirani 2003 (
link)) in our DMS datasets as described previously (Mehlhoff et al. 2020 (
link)). For
TEM-1, we estimated that our data would contain approximately 55.0 false positives on average at
P < 0.01 significance and an estimated 5.6 false positives on average at
P < 0.001 significance for a single replica (Mehlhoff et al. 2020 (
link)). Those values are 44.1 and 4.3 (
CAT-I), 52.8 and 5.3 (
NDM-1), and 33.8 and 3.4 (
aadB) at
P < 0.01 and
P < 0.001 significance, respectively. We chose to report the frequency of mutations having fitness effects that met the
P-value criteria in both replica experiments to limit the occurrence of false positives.
Mehlhoff J.D, & Ostermeier M. (2023). Genes Vary Greatly in Their Propensity for Collateral Fitness Effects of Mutations. Molecular Biology and Evolution, 40(3), msad038.