These complete and near-complete genomes were used as input for phylogenetic trees and genome-wide clustering to establish groups and potential taxa within the virophages. For phylogenetic trees, the sequences of the four morphogenesis genes detected in the 257 complete and near-complete genomes using the new HMM profiles (see above) were used after excluding all sequences that covered <60% of the HMM profile to remove partial gene predictions. Multiple alignments were then built for each gene using an iterative clustering-alignment-phylogeny procedure specifically adapted for aligning highly diverging sequences [54 (link)]. The alignments were then automatically trimmed using clipkit v1.3.0 [55 (link)] using the kpi-smart-gap mode to remove uninformative positions, and the trimmed alignments were used as input for tree building with IQ-Tree v2.2.0.3 [56 (link)] with automatic detection of the most appropriate substitution matrix, and 1000 replicates of ultra-fast bootstraps. The best-fit model was Q.pfam+F+R7 for PRO, Q.yeast+F+R8 for ATPase, and Q.pfam+F+R8 for both MCP and penton. For the larger MCP phylogeny, including both complete and partial virophage genomes (
Genome-wide amino acid identity (AAI) clustering was performed as in [57 (link)]. Briefly, predicted protein sequences from the 257 complete and near-complete virophages were compared all-vs-all using diamond v0.9.24.125 [58 (link)] and the following options: “--evalue 1e-5 --max-target-seqs 10,000–query-cover 50–subject-cover 50”. The resulting file was used as input for the script “amino_acid_identity.py” to calculate the average AAI for all pairs of genomes. The script “filter_aai.py” was then used to select only pairs of genomes with a minimum normalized cumulative bit score of 0.05. Finally, these selected pairwise AAI values were used as input for an MCL clustering using MCL 14-137 (inflation parameter = 1.1) [50 ].