We adapted the existing BUSTED test of positive selection (Murrell et al. 2015 (link)) to account for the presence of SRV and call the new method BUSTED[S]. To explore the generality of our findings about FPRs in the presence of SRV we also investigated a second existing test of selection, the M1a versus M2a comparison from Wong et al. (2004) (link), modified slightly to employ MG94 substitution models.
BUSTED[S] is a straightforward extension of BUSTED (Murrell et al. 2015 (link)). The nucleotide substitution process is modeled using the standard finite state continuous time Markov process approach of Muse and Gaut (1994) (link), with entries of the instantaneous rate matrix Q corresponding to substitutions between sense codons i and j denoted as
qij={αsθijπjp1-step synonymous change,αsωbsθijπjp1-step nonsynonymous change,0otherwise.
The θij (with θij=θji ) are parameters governing nucleotide substitution biases. For example, θACT,AGT=θCG and because we incorporate the standard nucleotide GTR model there are five identifiable θij parameters: θAC,θAT,θCG,θCT , and θGT , with θAG1 . The position-specific equilibrium frequency of the target nucleotide of a substitution is πjp ; for example, it is πG2 for the second-position change associated with qACT,AGT . The πjp and the stationary frequencies of codons under this model are estimated using the CF3 × 4 procedure (Kosakovsky Pond et al. 2010 (link)), adding nine parameters to the model. The ratio of nonsynonymous to synonymous substitution rates for site s along branch b is ωbs, and this ratio is modeled using a 3-bin general discrete distribution (GDD) with five estimated hyperparameters: 0ω1ω21ω3,p1=P(ωbs=ω1) , and p2=P(ωbs=ω2) . The procedure for efficient computation of the phylogenetic likelihood function for these models was described in Kosakovsky Pond et al. (2011) (link). The quantity αs is a site-specific synonymous substitution rate (no branch-to-branch variation is modeled) drawn from a separate 3-bin GDD. The mean of this distribution is constrained equal to one to maintain statistical identifiability, resulting in four estimated hyperparameters: 0cα1<α2=ccα3,f1=P(αs=α1) , and f2=P(αs=α2) , with c chosen to ensure that E{αs}=1 . Typical implementations, including ours, allow the number of α and ω rate categories to be separately adjusted by the user, for example, to minimize AICc or to optimize some other measure of model fit. The default setting of three categories generally provides a good balance between fit and performance when using this GDD approach for modeling. Our HyPhy implementation of BUSTED[S] will warn the user if there is evidence of model overfitting, such as the appearance of rate categories with very similar estimated rate values or very low frequencies.
The BUSTED[S] procedure for identifying positive selection is the likelihood ratio test comparing the full model described above to the constrained model formed when ω3 is set equal to 1 (i.e., no positively selected sites). Critical values of the test are derived from a 50:50 mixture distribution of χ02 and χ22 . Note that this asymptotic statistic differs from the 3-component mixture used by Murrell et al. (2015) (link); the simulation studies performed in the current study suggest that this less conservative mixture is sufficient to maintain nominal Type I errors. Both BUSTED[S] and BUSTED analyses in the current work use the same 50:50 mixture test statistic. BUSTED[S] reduces to BUSTED by setting αs=1 , that is, by placing all the mass of the synonymous rate heterogeneity distribution at α = 1. The method is implemented as a part of HyPhy (version 2.5.1 or later). BUSTED[S] is available for free public use on the Datamonkey webserver (Weaver et al. 2018 (link)) at https://www.datamonkey.org/BUSTED (last accessed February 24, 2020).
Free full text: Click here