Optimizing Target Enrichment Efficiency in Phylogenomic Studies

Target enrichment is typically conducted on multiple samples that have been pooled during bait hybridization and sequencing. HybPiper maps reads against the target genes for each sample separately. This is a different procedure than several other target enrichment analysis pipelines (Straub et al., 2011 (link); Bi et al., 2012 (link); Faircloth, 2015 ), which typically begin with de novo assembly for each sample, and then attempt to match contigs to target loci. In HybPiper, reads are first sorted based on whether they map to a target locus. We explored two methods for aligning reads to the targets: (1) BLASTX (Camacho et al., 2009 (link)), which uses peptide sequences as a reference, and (2) BWA (Li and Durbin, 2009 ), which uses nucleotide sequences. In principle, the BLASTX approach should be more forgiving to substitutions between the target sequence and sample reads, because alignments are conducted at the peptide level and may detect similarity between more distant sequences than BWA. The BWA approach may result in fewer overall reads mapping to a distantly related target sequence, but is several times faster than the BLASTX method.
HybPiper sorts reads into separate directories for each gene using Biopython (Cock et al., 2009 ) to efficiently parse the FASTA format. In our tests of the BLASTX method, we set an E-value threshold of 1 × 10⁻⁵ to accept alignments, but the user can change this. For the BWA method, all alignable reads are sorted into each gene directory using a Python wrapper around SAMtools (Li et al., 2009 ). We calculate the enrichment efficiency as the percentage of trimmed, filtered reads that were sorted into a gene directory.
For the Artocarpus reads, an average of 71.9% of reads were on target (range 64.4–79.9%), based on the BLASTX method. Enrichment efficiency was lower for some of the outgroup samples, which ranged from just 5.0% for Antiaropsis K. Schum. to 71.6% for Ficus L. To address whether the presence of duplicate reads affects our estimate of enrichment efficiency, we removed paired duplicate reads using SuperDeduper (http://dstreett.github.io/Super-Deduper/). Most samples had between 6% and 18% duplicate read pairs, and a similar percentage of the duplicate read pairs mapped to the target loci (Appendix S1). One outlier was Ficus, which had 34% duplicate reads, 42% of which mapped to targets. After adjusting for duplicate reads, our estimates of enrichment efficiency were reduced by about 4% on average (Table 1). Removing duplicate reads did not affect the extraction of exon sequences in HybPiper for this data set.
The phylogenetic distance to Artocarpus did not seem related to percent enrichment. However, the two outgroup samples that were pooled in a hybridization with Artocarpus in the first sequencing run had much lower enrichment efficiency than ingroup samples (Table 1). This suggests that multiplexing at the hybridization stage should be nonrandom, and only libraries of taxa that are relatively equidistant from the taxa used to design the bait sequences should be pooled. This strategy has been previously recommended in other studies (McGee et al., 2016 ).

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Johnson M.G., Gardner E.M., Liu Y., Medina R., Goffinet B., Shaw A.J., Zerega N.J, & Wickett N.J. (2016). HybPiper: Extracting coding sequence and introns for phylogenetics from high-throughput sequencing reads using target enrichment. Applications in Plant Sciences, 4(7), apps.1600016.

Publication 2016

Artocarpus Exon Ficus Gene Hybridization Maps Nucleotide sequences Peptide Python

Corresponding Organization : Brooklyn Botanic Garden

Other organizations : Northwestern University, University of Connecticut, Duke University

Top 5 similar protocols

Protocol cited in 6 other protocols

Variable analysis

independent variables

Two methods for aligning reads to the targets: (1) BLASTX, which uses peptide sequences as a reference, and (2) BWA, which uses nucleotide sequences.

dependent variables

Enrichment efficiency, calculated as the percentage of trimmed, filtered reads that were sorted into a gene directory.

control variables

E-value threshold of 1 × 10^-5 to accept BLASTX alignments (can be changed by the user).
Removing paired duplicate reads using SuperDeduper.

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!