All assemblies were evaluated using metrics generated by the Transrate program (v1.0.3) [53 (link)]. Trimmed reads were used to calculate a Transrate score for each assembly, which represents the geometric mean of all contig scores multiplied by the proportion of input reads providing positive support for the assembly [50]. Comparative metrics were calculated using Transrate for each MMETSP sample between DIB and the NCGR assemblies using the Conditional Reciprocal Best Basic Local Alignment Search Tool hits (CRBB) algorithm [54 (link)]. A forward comparison was made with the NCGR assembly used as the reference and each DIB re-assembly as the query. Reverse comparative metrics were calculated with each DIB re-assembly as the reference and the NCGR assembly as the query. Transrate scores were calculated for each assembly using the Trimmomatic quality-trimmed reads prior to digital normalization.
Benchmarking Universal Single-Copy Orthologs (BUSCO) software (version 3) was used with a database of 215 orthologous genes specific to protistans and 303 genes specific to eukaryota with open reading frames (ORFs) in the assemblies. BUSCO scores are frequently used as one measure of assembly completeness [55 (link)].
To assess the occurrences of fixed-length words in the assemblies, unique 25-mers were measured in each assembly using the HyperLogLog (HLL) estimator of cardinality built into the khmer software package [56 (link)]. We used the HLL function to digest each assembly and count the number of distinct fixed-length substrings of DNA (k-mers).
Unique gene names were compared from a random subset of 296 samples using the dammit annotation pipeline [49 ]. If a gene name was annotated in NCGR but not in DIB, this was considered a gene uniquely annotated in NCGR. Unique gene names were normalized to the total number of annotated genes in each assembly.
A Tukey’s honest significant different post-hoc range test of multiple pairwise comparisons was used in conjunction with an analysis of variance to measure differences between distributions of data from the top eight most-represented phyla (Bacillariophyta, Dinophyta, Ochrophyta, Haptophyta, Ciliophora, Chlorophyta, Cryptophyta, and Others) using the agricolae package version 1.2-8 in R version 3.4.2 (2017-09-28). Margins sharing a letter in the group label are not significantly different at the 5% level (refer to Fig.