The 25,000 viral protein families (VPFs) used to identify UViGs were queried against the ViralZone database (12 (link)), where viral hosts were predicted at different taxonomic levels. 11 400 VPFs had at least one hit to the virus genomes and an average of 6.8 hits per model was calculated. For each VPF, a score value (between 0 and 1) was obtained dividing the total number of hits with a uniform distribution (only present in a single host domain) by the total number of VPF hits [i.e. score = (#uniform hits/#total hits)]. In the cases where the total number of hits was below the average number of hits, we corrected the score as follows: [(#uniform hits/#total hits) × (#total hits/average #hits)].
3788 VPFs were assigned with the maximum 1.0 score, representing those models found in at least seven known viral genomes and with a uniform domain distribution. The presence of these VPFs across the UViGs allowed us to separate 65% of the viral genomes into prokaryotic (bacteriophages and archaeal viruses), or eukaryotic viruses.
This approach has been benchmarked using the host assignment of the viral genomes containing pVOGs (13 (link)) with homology to our 1.0-score VPFs (2,037 pVOGs) with ≥95% homology based on hhsearch (14 (link)). Our classification was consistent with the classification in the pVOG database in all 98.6% of the cases. The remaining 1.4% resulted in viruses annotated as ‘archaea-bacteria’ viruses in the pVOG database that were identified as either bacteria or archaea using our approach. Thus, we can estimate that there was a 100% consistency of this method separating prokaryotic and eukaryotic viruses.
Free full text: Click here