Initial studies were designed to assess the ability of the reference strains (initially selected from strains curated at the RNA Virus Database (17 (link)) (http://virus.zoo.ox.ac.uk/rnavirusdb/), Los Alamos HIV (http://www.hiv.lanl.gov) and HCV Sequence Databases (http://www. hcv.lanl.gov) to accurately classify a set of well-classified (gold standard) genomic sequences. Individual NJ trees were constructed for each test genome together with its appropriate reference set. Phylogenetic analyses were performed separately on each complete HIV-1, HCV and HBV genome, as well as on sub-genomic regions of HTLV-1, HPV and HHV8. Test sequences in the ‘gold standard’ dataset were considered to be accurately classified if they clustered within a known genotype, or sub-genotype, with a bootstrap value >70%. Fragments as large as 1000 nt in length were successfully genotyped using our genotyping tools. Reference alignments of complete and sub-genomic gold standard sequences that gave a bootstrap value of >95% were deemed suitable for routine use (16 (link)).
As with all genotyping tools, the accuracy and consistency of the data is dependent on the selection of appropriate reference sequences. To overcome the limitations of other commonly used methods that employ a single reference sequence or a consensus reference sequence (SIMPLOT, RIP and NCBI-genotyping tools), we used sets of carefully selected, full-length viral genomes to represent each individual subtype and recombinant virus. The initial step in the selection of reference strains involved the screening of published data to identify highly divergent, but equidistant, genomes that were representative of the diversity within a given subtype or CRF. The selected sequences were then aligned, edited and subjected to phylogenetic analysis using NJ, Bayesian and ML methods (18–20 ). Sequences that gave similar topologies using all three tree construction methods were retained for further analysis of their sub-genomic regions. In this phase of the evaluation, the sub-genomic regions were assessed using consecutive windows of fixed, but increasing, sizes, ranging from 200 to 2000 nt. The process began with an initial window size of 200 nt and was repeated with subsequent windows until all segments of the genome were classified with a bootstrap value of ≥70%.