To generate reference sequences for repetitive element families from the Fritillaria genomes, we performed graph‐based clustering of unique nuclear 454 reads using the repeatexplorer pipeline via galaxy (Novák et al., 2010, 2013). Clustering was performed separately for F. affinis and F. imperialis to create a reference set of repeat families for each. Initial runs of repeatexplorer revealed that the number of reads from F. affinis that it is possible to cluster is limited by the presence of a relatively high‐abundance tandem repeat (corresponding to the FriSAT1 repeat identified by Ambrožová et al., 2011). The number of reads that can be analysed simultaneously by repeatexplorer is governed by the number of similarity hits produced, as all read overlaps are loaded into the computer memory during the graph‐based clustering step (Novák et al., 2013). Consequently, this limit does not differ greatly between, for example, 200 and 400 bp reads (it is recommended that reads of the same length are used), allowing coverage to be increased by analysing longer reads. Therefore, to maximize the genome coverage for F. affinis, clustering was performed on 400 bp reads; custom Perl scripts were used to trim reads of > 400 bp from the 3′ end and to remove any reads of < 400 bp. For F. affinis, all 400 bp reads were inputted into repeatexplorer, allowing it to randomly subsample the data set to the maximum number of reads that could be processed (830 674 of 1056 953 available 400 bp reads were used). A random sample of 400 bp reads (842 670) from F. imperialis was taken using the sequence sampling tool (v1.0.0) in repeatexplorer to create a data set providing the same level of genome coverage (0.74%) as for F. affinis. The clustering pipeline was run with ≥ 220 bp overlap for clustering and ≥ 160 bp overlap for assembly. All clusters containing ≥ 0.01% of the input reads were examined manually to identify clusters that required merging (i.e. where there was evidence that a single repeat family had been split over multiple clusters). Clusters were merged if they met the following criteria: they formed connected components with a significant number of similarity hits between the clusters (e.g. in a pair of clusters, 5% of the reads in the smaller cluster had Blast hits to reads in the larger cluster); they were of the same repeat type (e.g. Copia LTR retrotransposons); they would be merged in a logical position (e.g. for repetitive elements containing conserved domains, clusters were only merged if it would result in the conserved domains being joined in the correct order). The reclustering pipeline was run using ≥ 160 bp overlap for assembly and the merged clusters were examined manually to verify that all domains were in the correct orientation.
Clusters were annotated in repeatexplorer according to hits from Blast searches to the repeatmasker Viridiplantae database and to a database of conserved domains; where a substantial number of reads matched the same repeat type (e.g. 20% of reads in the cluster matching a Gypsy LTR retrotransposon) these annotations were retained. For clusters not annotated in repeatexplorer (i.e. no significant Blast hits), or where only very few reads had a Blast hit or separate reads matched different repeat types (i.e. inconsistent Blast hits), contigs were searched against GenBank using Blastn and Blastx (Altschul et al., 1997) and submitted to Tandem Repeat Finder (Benson, 1999).
To calculate the proportion of the genome (genome proportion (GP)) comprised of each repeat family (i.e. cluster), we conducted Blast searches of all unique nuclear reads (Table S4) against databases of the contigs from the clustering analysis. GP was calculated for all clusters containing ≥ 0.05% of the reads inputted into repeatexplorer (Tables S5, S6; we refer to these as the ‘top’ repeat families); we used ≥ 0.05% reads as a cut‐off as these clusters contain > 165 kb of data, which is sufficient to provide several‐fold coverage for most known repetitive elements (e.g. see http://gydb.org), and therefore can be expected to represent complete elements. Contigs from all clusters were used to create separate custom Blast databases for F. affinis and F. imperialis using the makeblastdb tool in Blast+ (v2.2.24+; Camacho et al., 2009). The unique nuclear read data sets from each of the 10 species sequenced (Table S4) were searched against each database using megablast in the Blastn tool in Blast+ (v2.2.24+). To capture the maximum number of hits, searches were conducted with a relaxed E‐value of 100 and no filter for low‐complexity sequence (further increases to the E‐value cut‐off did not result in additional hits); a single hit was recorded for each read. Blast results were then filtered using a custom Perl script to retain only those where ≥ 55% of the query read matched one of the contigs, with ≥ 90% similarity between the query and subject in the matching portion. We calculated the GP from the filtered Blast hits using a custom Perl script. For each contig, the number of bases of the query sequence participating in the top high‐scoring pair for each Blast hit was summed to give the total number of bp representing each contig in the data sets of unique nuclear reads. For each cluster, the number of bp for all of its contigs was summed and expressed as a percentage of the total data set size (i.e. total number of bp in the set of unique nuclear reads; Table S4) to give the value for GP. The genomic abundance of each cluster in Mb was calculated as follows: (total Mb of cluster in data set × genome size in Mb/data set size in Mb). GP and Mb estimates for the top clusters in F. affinis and F. imperialis are shown in Tables S5 and S6.
Free full text: Click here