Simulating and Reconstructing Cyanobacterial Gene Histories

To validate our approach we simulated sequences using tree topologies, branch lengths, and alignment sizes based on 1099 gene families from 36 cyanobacterial genomes available in the HOGENOM database (Penel et al. 2009 (link)). As described in detail in Appendix 1 and illustrated in Figure 2a, to generate the set of simulated alignments we first reconstructed reconciled gene trees that maximize the joint likelihood and subsequently used the reconstructed gene trees to simulate amino acid sequences. To emulate the relative complexity of real data compared with available models of sequence evolution, we used a complex model of sequence evolution to simulate sequences—an LG model (Le and Gascuel 2008 (link)) with across-site rate variation and invariant sites, and attempted to reconstruct their history with a simple model—a Poisson model (Felsenstein 1981 (link)) with no rate variation.
Data.—To construct a simulated dataset, we first reconstructed gene trees for 1099 cyanobacterial gene families with 10 or more genes in any of the 36 cyanobacteria present in version 5 of the HOGENOM database (Penel et al. 2009 (link)). Families with more than 150 genes were not considered. For each family, amino acid sequences were extracted from the database and aligned using MUSCLE (v3.8.31) (Edgar 2004 (link)) with default parameters. The multiple alignment was subsequently cleaned using GBLOCKS (v0.91b) (Talavera and Castresana 2007 (link)) with the options:

Cleaned alignments are available from the Dryad data repository at http://datadryad.org, doi:10.5061/dryad.pv6df.
Reconstructing “real” trees.—For each cleaned alignment, an MCMC sample was obtained using PhyloBayes (v3.2e) (Lartillot et al. 2009 (link)) using an LG+Γ4+I substitution model (Le and Gascuel 2008 (link)) with a burn-in of 1000 samples followed by at least 3000 samples. Following this step, gene families were separated into two datasets: (i) dataset I, composed of 342 universal single-copy families with exactly one copy in each of the 36 cyanobacteria and, (ii) dataset II, which includes dataset I, and is composed of 1099 families, each with at least 10 genes in any of the 36 cyanobacterial genomes considered. For the 342 single-copy universal gene families of dataset I 10 000 trees were sampled.
For each family, we used the species tree shown in Figure A.4, sampled reconciled gene trees using ALEsample (sampling at least 5000 reconciled trees) to sample DTL rates and reconciled gene trees, and ALEml to find the ML DTL rates and the corresponding ML reconciled gene tree.
For each ALEsample sample, we computed the majority consensus tree and fully resolved “real” trees for each gene family were calculated based on the ALEsample sample of trees by finding the tree that maximized CCPs based on the sample. For both real and simulated alignments, sequence-only trees were also inferred using PhyML (version 20110526) (Guindon and Gascuel 2003 (link)) using the LG+Γ4+I model with the options:

“Real” gene trees are available from the Dryad data repository at http://datadryad.org, doi:10.5061/dryad.pv6df.
Sequence simulation.—To simulate amino acid sequences, we used bppseqgen (v1.1.0) (Dutheil and Boussau 2008 (link)) keeping the branch lengths and alignment sizes and using the COMPLEX model corresponding to an LG model with site rate variation described by a gamma distribution with α = 0.1 and 10% invariant sites.
Simulated alignments are available from the Dryad data repository at http://datadryad.org, doi:10.5061/dryad.pv6df.
Inference for simulated data.—For each simulated alignment, an MCMC sample was obtained using PhyloBayes (v3.2e) using a SIMPLE model corresponding to a Poisson model (Felsenstein 1981 (link)) with no rate variation.
We sampled 10 000 trees after a burn-in of 1000 samples with a sample taken every 10 iterations. For the simulated sequence corresponding to the 342 single-copy universal gene families of dataset I, we also sampled trees using the COMPLEX model corresponding to an LG+Γ4+I substitution model, sampling 3000 trees after a burn-in of 1000 samples.
For each family, we sampled reconciled gene trees using ALEsample (sampling at least 5000 reconciled trees) to sample DTL rates and reconciled gene trees, and ALEml to find the ML DTL rates and the corresponding ML reconciled gene tree.
Distances to the “real” tree for gene trees of dataset I (Fig. 2b) were computed as the distance between majority consensus trees calculated from the sequence-only PhyloBayes samples for both the SIMPLE and the COMPLEX model as well as the joint ALEsample samples for both. The same procedure was used for the simulated sequence corresponding to dataset II (Fig. A.1a) for the SIMPLE model. For the COMPLEX model, joint trees were not computed and PhyML trees were used for the sequence-only trees.
Inference of numbers of DTL events.—The number of DTL events for joint trees was inferred using ALEml using a sample of trees obtained using the SIMPLE model. The number of DTL events for sequence trees was inferred using ALEml using fixed PhyML trees (based on LG+Γ4+I substitution model).
ML reconciled trees are available from the Dryad data repository at http://datadryad.org, doi:10.5061/dryad.pv6df.
Statistical support.—Statistical support of bipartitions was calculated from samples of gene trees obtained either using PhyloBayes, for the sequence-only case, or using ALEsample in the joint case. The support of each observed bipartition was estimated as the fraction of all trees in which it was present.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Szöllősi G.J., Rosikiewicz W., Boussau B., Tannier E, & Daubin V. (2013). Efficient Exploration of the Space of Reconciled Gene Trees. Systematic Biology, 62(6), 901-912.

Publication 2013

Amino acid sequences Cyanobacterial Evolution Fig trees Gamma Gene Genomes Joint Muscle Trees

Corresponding Organization :

Other organizations : Université Claude Bernard Lyon 1, Centre National pour la Recherche Scientifique et Technique (CNRST), Centre National de la Recherche Scientifique, Eötvös Loránd University, Laboratoire de Biométrie et Biologie Evolutive, Adam Mickiewicz University in Poznań, University of California, Berkeley, Centre de Recherche en Informatique

Top 5 similar protocols

Protocol cited in 34 other protocols

Variable analysis

independent variables

None explicitly mentioned

dependent variables

Distances to the "real" tree for gene trees of dataset I and dataset II
Number of DTL events for joint trees and sequence trees

control variables

Branch lengths and alignment sizes used in sequence simulation
Species tree used to sample reconciled gene trees
Substitution models used for sequence simulation (LG+Γ4+I) and inference (Poisson model, LG+Γ4+I)

controls

Positive control: None mentioned
Negative control: None mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!