Data.—To construct a simulated dataset, we first reconstructed gene trees for 1099 cyanobacterial gene families with 10 or more genes in any of the 36 cyanobacteria present in version 5 of the HOGENOM database (Penel et al. 2009 (link)). Families with more than 150 genes were not considered. For each family, amino acid sequences were extracted from the database and aligned using MUSCLE (v3.8.31) (Edgar 2004 (link)) with default parameters. The multiple alignment was subsequently cleaned using GBLOCKS (v0.91b) (Talavera and Castresana 2007 (link)) with the options:
Cleaned alignments are available from the Dryad data repository at
Reconstructing “real” trees.—For each cleaned alignment, an MCMC sample was obtained using PhyloBayes (v3.2e) (Lartillot et al. 2009 (link)) using an LG+Γ4+I substitution model (Le and Gascuel 2008 (link)) with a burn-in of 1000 samples followed by at least 3000 samples. Following this step, gene families were separated into two datasets: (i) dataset I, composed of 342 universal single-copy families with exactly one copy in each of the 36 cyanobacteria and, (ii) dataset II, which includes dataset I, and is composed of 1099 families, each with at least 10 genes in any of the 36 cyanobacterial genomes considered. For the 342 single-copy universal gene families of dataset I 10 000 trees were sampled.
For each family, we used the species tree shown in
For each ALEsample sample, we computed the majority consensus tree and fully resolved “real” trees for each gene family were calculated based on the ALEsample sample of trees by finding the tree that maximized CCPs based on the sample. For both real and simulated alignments, sequence-only trees were also inferred using PhyML (version 20110526) (Guindon and Gascuel 2003 (link)) using the LG+Γ4+I model with the options:
“Real” gene trees are available from the Dryad data repository at
Sequence simulation.—To simulate amino acid sequences, we used bppseqgen (v1.1.0) (Dutheil and Boussau 2008 (link)) keeping the branch lengths and alignment sizes and using the COMPLEX model corresponding to an LG model with site rate variation described by a gamma distribution with α = 0.1 and 10% invariant sites.
Simulated alignments are available from the Dryad data repository at
Inference for simulated data.—For each simulated alignment, an MCMC sample was obtained using PhyloBayes (v3.2e) using a SIMPLE model corresponding to a Poisson model (Felsenstein 1981 (link)) with no rate variation.
We sampled 10 000 trees after a burn-in of 1000 samples with a sample taken every 10 iterations. For the simulated sequence corresponding to the 342 single-copy universal gene families of dataset I, we also sampled trees using the COMPLEX model corresponding to an LG+Γ4+I substitution model, sampling 3000 trees after a burn-in of 1000 samples.
For each family, we sampled reconciled gene trees using ALEsample (sampling at least 5000 reconciled trees) to sample DTL rates and reconciled gene trees, and ALEml to find the ML DTL rates and the corresponding ML reconciled gene tree.
Distances to the “real” tree for gene trees of dataset I (
Inference of numbers of DTL events.—The number of DTL events for joint trees was inferred using ALEml using a sample of trees obtained using the SIMPLE model. The number of DTL events for sequence trees was inferred using ALEml using fixed PhyML trees (based on LG+Γ4+I substitution model).
ML reconciled trees are available from the Dryad data repository at
Statistical support.—Statistical support of bipartitions was calculated from samples of gene trees obtained either using PhyloBayes, for the sequence-only case, or using ALEsample in the joint case. The support of each observed bipartition was estimated as the fraction of all trees in which it was present.