A set of public phage Whole Genome Sequences (WGS) was collected in August 2014: First, lists of phage WGS IDs were obtained from Phages.ids–VBI mirrors page [40 ], the NCBI viral Genome Resource [41 ], the EMBL EBI phage genomes list [42 ], and the phagesdb databases for Mycobacteriophages [43 ], Arthrobacter [44 ], Bacillus [45 ], and Streptomyces [46 ]. The resulting unique list of IDs was uploaded to the Batch Entrez service of NCBI to retrieve the corresponding WGS. Furthermore genome sequences were downloaded from the PhAnToMe genomes database and from NCBI searching for “(phage [Title]) AND complete genome”.
Only entries indicating "complete genome" in the DEFINITION field of the GeneBank file and which host taxonomy was specified at least at the genus level were included. Entries annotated as "prophage" in the DEFINITION were removed. Hosts annotated as Salmonella Typhimurium were re-annotated as Salmonella enterica according to current nomenclature [47 (link)]. Finally, only the genus was taken into account for hosts with species specified as "sp." followed by an alphanumeric code; for example Synechococcus sp. WH7803 was re-annotated as Synechococcus. 2196 phages had annotated host genus, here called phagesgenus dataset, and of these, 1871 had annotated species as well, phagesspecies . A total of 209 different host species and 129 different genera were represented among the phages (this data is available in HostPhinder’s repository [48 ]). Figure 1 shows the distribution of hosts in the dataset.
Free full text: Click here