We built upon a previous phylogenomic dataset [39 (link)] to select a curated set of 258 orthologous markers for deuterostomes. Alignments were complemented with sequences from the National Center for Biotechnology Information (NCBI) databases using a multiple best reciprocal hit approach implemented in the newly designed Forty-Two software [40 ]. Because 454 DNA sequence reads are characterized by sequencing errors typically disrupting the reading frame when translated into amino acids, alignments were verified by eye using the program ED from the MUST package [41 (link)]. Ambiguously aligned regions were excluded for each individual protein using Gblocks with medium default parameters [42 (link)] with a few subsequent manual refinements using NET from the MUST package to relax the fact that this automated approach is sometimes too conservative. This manual refinement step restored only 418 amino acid sites (i.e. 0.6% of the total alignment length). Potential environmental contaminations and cross-contaminations between our samples were also dealt with at the alignment stage by performing Basic Local Alignment Search Tool (BLAST) searches of each sequence against a taxon-rich reference database maintained for each curated gene alignment and were further sought by a visual examination of each individual gene phylogeny.
The concatenation of the resulting 258 amino acid alignments was constructed with SCaFoS [43 (link)] by defining 63 deuterostomian operational taxonomic units (OTUs) representing all major lineages. The taxon sampling included 18 tunicates, 34 vertebrates, and one cephalochordate, with seven echinoderms, two hemichordates, and one xenoturbellid as more distant outgroups. When several sequences were available for a given OTU, the slowest evolving one was selected by SCaFoS, according to maximum likelihood distances computed by TREE-PUZZLE [44 (link)] under a WAG+F model. The percentage of missing data per taxon was reduced by creating some chimerical sequences from closely related species (i.e. Eptatretus burgeri/ Myxine glutinosa, Petromyzon marinus/Lethenteron japonicum, Callorhinchus miliiC. callorynchus, Latimeria menadoensis/L. chalumnae, Rana chensinensis/ R. catesbeiana, Alligator sinensisA. mississippiensis, Chrysemys pictaEmys orbicularisTrachemys scripta, Patiria miniataP. pectiniferaSolaster stimpsonii, Apostichopus japonicusParastichopus parvimensis, Ophionotus victoriaeAmphiura filiformis) and by retaining only proteins with at most 15 missing OTUs. The tunicate Microcosmus squamiger was excluded at this stage due to a high percentage of missing data resulting from the low number of contigs obtained in the assembly. The final alignment comprised 258 proteins and 63 taxa for 66,593 unambiguously aligned amino acid sites with 20% missing amino acid data.
Free full text: Click here