To test the efficiency of AliGROOVE we designed two sets of nucleotide and amino acid sequence data using 4-taxon and 6-taxon trees (Figure 1). The topology of the 4-taxon setup (setup A, Figure 1a) contained two long branches of unrelated taxa (with branch lengths BL 2 = 0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5) under three different branch length conditions for the other two short terminal branches (BL 3 = 0.1,0.12,0.14 and RB = 0.1) and two different lengths of the short internal branch (BL 1 = 0.01,0.02). The 6-taxon setup (setup B, Figure 1b) contained two long internal branches (BL 2 = 0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5), separated by a short internal branch (BL 1 = 0.01) while the lengths of terminal branches are kept constant (BL 3 = 0.01 and RB = 0.1). For both test setups, 100 alignments were generated for each step of BL 2 branch elongation. Sequence length of each alignment of setup A was set to 250,000 character state positions and for setup B to 50,000 character state positions to reduce the calculation time. All alignments were generated with INDELible v.1.03 [24 (link)]. In order to simulate nucleotide sequence data we used the Jukes-Cantor model (JC) of sequence evolution and for amino acid sequence data the BLOSUM62 substitution model. All data were simulated with among site rate variation (ASRV), using a mixed-distribution model with a shape parameter α = 1.0, and a proportion of invariant sites ρinv= 0.3. ASRV was modelled using a continuous Γ-rate distribution while indel events were not simulated.
Trees of simulated data were inferred with PhyML_3.0_linux64 [25 (link), 26 (link)]. We analyzed the data with a mixed-distribution model (JC+ Γ + I) and correct parameter values (α = 1.0, ρinv= 0.3), except for the categorization of the gamma distribution. The number of relative substitution rate categories was set to four (c = 4) and tree topologies and branch lengths were optimized. Maximum Likelihood analyses were performed and evaluated with a Perl pipeline. For each branch length-combination, we generated 100 data replicates and recorded the frequencies of correct and incorrect tree reconstructions using correct alignments and nearly correct substitution models (Figures 2, 3, Additional files 1, 2, 3).
Free full text: Click here