BUSTED[S] is a straightforward extension of BUSTED (Murrell et al. 2015 (link)). The nucleotide substitution process is modeled using the standard finite state continuous time Markov process approach of Muse and Gaut (1994) (link), with entries of the instantaneous rate matrix Q corresponding to substitutions between sense codons i and j denoted as
The θij (with ) are parameters governing nucleotide substitution biases. For example, and because we incorporate the standard nucleotide GTR model there are five identifiable θij parameters: , and , with . The position-specific equilibrium frequency of the target nucleotide of a substitution is ; for example, it is for the second-position change associated with . The and the stationary frequencies of codons under this model are estimated using the CF3 × 4 procedure (Kosakovsky Pond et al. 2010 (link)), adding nine parameters to the model. The ratio of nonsynonymous to synonymous substitution rates for site s along branch b is ωbs, and this ratio is modeled using a 3-bin general discrete distribution (GDD) with five estimated hyperparameters: , and . The procedure for efficient computation of the phylogenetic likelihood function for these models was described in Kosakovsky Pond et al. (2011) (link). The quantity αs is a site-specific synonymous substitution rate (no branch-to-branch variation is modeled) drawn from a separate 3-bin GDD. The mean of this distribution is constrained equal to one to maintain statistical identifiability, resulting in four estimated hyperparameters: , and , with c chosen to ensure that . Typical implementations, including ours, allow the number of α and ω rate categories to be separately adjusted by the user, for example, to minimize AICc or to optimize some other measure of model fit. The default setting of three categories generally provides a good balance between fit and performance when using this GDD approach for modeling. Our HyPhy implementation of BUSTED[S] will warn the user if there is evidence of model overfitting, such as the appearance of rate categories with very similar estimated rate values or very low frequencies.
The BUSTED[S] procedure for identifying positive selection is the likelihood ratio test comparing the full model described above to the constrained model formed when ω3 is set equal to 1 (i.e., no positively selected sites). Critical values of the test are derived from a mixture distribution of and . Note that this asymptotic statistic differs from the 3-component mixture used by Murrell et al. (2015) (link); the simulation studies performed in the current study suggest that this less conservative mixture is sufficient to maintain nominal Type I errors. Both BUSTED[S] and BUSTED analyses in the current work use the same mixture test statistic. BUSTED[S] reduces to BUSTED by setting , that is, by placing all the mass of the synonymous rate heterogeneity distribution at α = 1. The method is implemented as a part of HyPhy (version 2.5.1 or later). BUSTED[S] is available for free public use on the Datamonkey webserver (Weaver et al. 2018 (link)) at