Target enrichment is typically conducted on multiple samples that have been pooled during bait hybridization and sequencing. HybPiper maps reads against the target genes for each sample separately. This is a different procedure than several other target enrichment analysis pipelines (Straub et al., 2011 (
link); Bi et al., 2012 (
link); Faircloth, 2015 ), which typically begin with de novo assembly for each sample, and then attempt to match contigs to target loci. In HybPiper, reads are first sorted based on whether they map to a target locus. We explored two methods for aligning reads to the targets: (1) BLASTX (Camacho et al., 2009 (
link)), which uses peptide sequences as a reference, and (2) BWA (Li and Durbin, 2009 ), which uses nucleotide sequences. In principle, the BLASTX approach should be more forgiving to substitutions between the target sequence and sample reads, because alignments are conducted at the peptide level and may detect similarity between more distant sequences than BWA. The BWA approach may result in fewer overall reads mapping to a distantly related target sequence, but is several times faster than the BLASTX method.
HybPiper sorts reads into separate directories for each gene using Biopython (Cock et al., 2009 ) to efficiently parse the FASTA format. In our tests of the BLASTX method, we set an
E-value threshold of 1 × 10
−5 to accept alignments, but the user can change this. For the BWA method, all alignable reads are sorted into each gene directory using a Python wrapper around SAMtools (Li et al., 2009 ). We calculate the enrichment efficiency as the percentage of trimmed, filtered reads that were sorted into a gene directory.
For the
Artocarpus reads, an average of 71.9% of reads were on target (range 64.4–79.9%), based on the BLASTX method. Enrichment efficiency was lower for some of the outgroup samples, which ranged from just 5.0% for
Antiaropsis K. Schum. to 71.6% for
Ficus L. To address whether the presence of duplicate reads affects our estimate of enrichment efficiency, we removed paired duplicate reads using SuperDeduper (
http://dstreett.github.io/Super-Deduper/). Most samples had between 6% and 18% duplicate read pairs, and a similar percentage of the duplicate read pairs mapped to the target loci (
Appendix S1). One outlier was
Ficus, which had 34% duplicate reads, 42% of which mapped to targets. After adjusting for duplicate reads, our estimates of enrichment efficiency were reduced by about 4% on average (
Table 1). Removing duplicate reads did not affect the extraction of exon sequences in HybPiper for this data set.
The phylogenetic distance to
Artocarpus did not seem related to percent enrichment. However, the two outgroup samples that were pooled in a hybridization with
Artocarpus in the first sequencing run had much lower enrichment efficiency than ingroup samples (
Table 1). This suggests that multiplexing at the hybridization stage should be nonrandom, and only libraries of taxa that are relatively equidistant from the taxa used to design the bait sequences should be pooled. This strategy has been previously recommended in other studies (McGee et al., 2016 ).