For benchmarking ViralRecall on non-NCLDV viral sequences, we used a database of 879 non-NCLDV dsDNA genomes downloaded from NCBI. These genomes were selected because they are reference dsDNA viruses with genomes listed on the ICTV Virus Metadata Resource ([39 ]) and their genomes were available on NCBI RefSeq. The GVOG and Pfam normalization scores had been generated by using reference Caudovirales genomes in NCBI; therefore, we did not use genomes from this group that were present in the Virus Metadata Resource. Instead, we used a set of 336 jumbo bacteriophages (Caudovirales) that have been reported [40 (link)]. Additionally, we did not include any Lavidaviridae (virophage) in this set, because these viruses parasitize giant viruses, and, in some cases, may exchange genes with them [41 (link)]. To generate pseudocontigs for benchmarking, we used the gt-shredder command in genometools ([42 ]).
To benchmark ViralRecall on NCLDV sequences, we used a set of 1548 genomes in the NCLDV database described above. This included all genomes except those used in the construction of GVOGs, because those would not provide an unbiased assessment of the sensitivity of ViralRecall in detecting NCLDV sequences. For benchmarking purposes, ViralRecall was run with default parameters, with the only exception that the -c flag was used to generate mean contig-level scores.
For illustrative purposes we selected the following five NCLDV genomes from diverse families and provide the results generated by ViralRecall (shown in Figure 3): Acanthamoeba castellanii Medusavirus [43 (link)], Emiliania huxleyi virus 86 [7 (link)], Pithovirus sibericum [44 (link)], M. separata entomopoxvirus [45 (link)], and Hyperionvirus [46 (link)]. We also selected the genomes of four non-NCLDV dsDNA viruses for this purpose; we chose the jumbo bacteriophages with the highest mean score, lowest mean score, and longest length of those tested (FFC_PHAGE_43_1208, M01_PHAGE_56_67, and LP_PHAGE_CIR-CU-CL_32_18, respectively), as well as the human herpesvirus 3 strain Dumas [47 (link)]. Lastly, we also show the profiles for Yaravirus [48 (link)], a virus of A. castelanni with unclear evolutionary provenance, and the Sputnik virophage [41 (link)]. In all cases, the viral genomes shown here were not used in the construction of the GVOG database or for score normalization, thus they provide an unbiased assessment of ViralRecall results. For manual inspection of proteins encoded in contigs derived from suspected contamination, we performed homology searches against RefSeq v. 93 using BLASTP+ [49 (link)].
To illustrate how ViralRecall can be used to identify NCLDV signatures in eukaryotic genomes, we analyzed the Hydra vulgaris, Bigelowiella natans and Asterochloris glomerata genomes. Previous studies have already established NCLDV signatures in these genomes [19 (link),50 (link),51 (link)], and our results therefore provide independent verification.
To benchmark ViralRecall on NCLDV sequences, we used a set of 1548 genomes in the NCLDV database described above. This included all genomes except those used in the construction of GVOGs, because those would not provide an unbiased assessment of the sensitivity of ViralRecall in detecting NCLDV sequences. For benchmarking purposes, ViralRecall was run with default parameters, with the only exception that the -c flag was used to generate mean contig-level scores.
For illustrative purposes we selected the following five NCLDV genomes from diverse families and provide the results generated by ViralRecall (shown in Figure 3): Acanthamoeba castellanii Medusavirus [43 (link)], Emiliania huxleyi virus 86 [7 (link)], Pithovirus sibericum [44 (link)], M. separata entomopoxvirus [45 (link)], and Hyperionvirus [46 (link)]. We also selected the genomes of four non-NCLDV dsDNA viruses for this purpose; we chose the jumbo bacteriophages with the highest mean score, lowest mean score, and longest length of those tested (FFC_PHAGE_43_1208, M01_PHAGE_56_67, and LP_PHAGE_CIR-CU-CL_32_18, respectively), as well as the human herpesvirus 3 strain Dumas [47 (link)]. Lastly, we also show the profiles for Yaravirus [48 (link)], a virus of A. castelanni with unclear evolutionary provenance, and the Sputnik virophage [41 (link)]. In all cases, the viral genomes shown here were not used in the construction of the GVOG database or for score normalization, thus they provide an unbiased assessment of ViralRecall results. For manual inspection of proteins encoded in contigs derived from suspected contamination, we performed homology searches against RefSeq v. 93 using BLASTP+ [49 (link)].
To illustrate how ViralRecall can be used to identify NCLDV signatures in eukaryotic genomes, we analyzed the Hydra vulgaris, Bigelowiella natans and Asterochloris glomerata genomes. Previous studies have already established NCLDV signatures in these genomes [19 (link),50 (link),51 (link)], and our results therefore provide independent verification.