We downloaded 735,106 contigs >5 kb from IMG/VR 2.0 (ref. 25 (link)), after exclusion of viral genomes from cultivated isolates and proviruses identified from microbial genomes. We also downloaded 488,131 contigs >5 kb or circular from the GOV 2.0 dataset6 (link) (datacommons.cyverse.org/browse/iplant/home/shared/iVirus/GOV2.0). These were used as input to CheckV to estimate the completeness, identify host–virus boundaries and predict closed genomes. When running the completeness module, we excluded perfect matches (100% AAI and 100% AF) to prevent any DTR contig from matching itself in the database (since IMG/VR 2.0 and GOV 2.0 were used as data sources to form the CheckV database). A Circos plot61 (link) was used to link IMG/VR contigs to their top matches in the CheckV database. Protein-coding genes were predicted from proviruses using Prodigal and compared to HMMs from KEGG Orthology (release 2 October 2019)45 (link) using hmmsearch from the HMMER package v.3.1b2 (≤1 × 10–5 and score ≥30). Pfam domains with the keyword ‘integrase’ and ‘recombinase’ were also identified across all proviruses.
The largest DTR contig we identified from IMG/VR was further annotated to illustrate the type of virus and genome organization represented (IMG ID: 3300025697_____Ga0208769_1000001). Coding sequence prediction and functional annotations were obtained from IMG35 (link). Annotation for virus hallmark genes including a terminase large subunit (TerL) and major capsid protein were confirmed via HHPred v.3.2.0 (ref. 62 (link)) (databases included PDB 70_8, SCOPe70 2.07, Pfam-A 32.0 and CDD 3.18, score >98). A circular genome map was drawn with CGView63 (link). To place this contig in an evolutionary context, we built a TerL phylogeny including the most closely related sequences from a global search for large phages42 (link). The TerL amino acid sequence from the DTR contig was compared to all TerL sequences from the ‘huge phage’ dataset via blastp (≤1 × 10–5, score ≥50) to identify the 30 most similar sequences (sorted based on blastp bit-score). These reference sequences and DTR contigs were aligned with MAFFT v.7.407 (ref. 64 (link)) using default parameters, the alignment automatically cleaned with trimAL v.1.4.rev15 with the option ‘--gappyout’65 (link) and a phylogeny built with IQ-Tree v.1.5.5, with default model selection (optimal model suggested: LG+R4)66 (link). The resulting tree was visualized with iToL67 (link).
Free full text: Click here