PHAST's prophage sequence database consists of a custom collection of phage and prophage protein sequences from two sources. One is the National Center for Biotechnology Information (NCBI) phage database that includes 46 407 proteins from 598 phage genomes. The other source is from the prophage database (12 ), which consists of 159 prophage regions and 9061 proteins not found in the NCBI phage database. Since many of the prophage proteins in the prophage database are actually bacterial proteins and some have only been identified computationally, we only selected those prophage proteins that have been associated with a clear phage function. This set includes a total of 379 phage protease, integrase and structural proteins. This PHAST phage library is used to identify putative phage proteins in the query genome via BLASTP (13 (link)) searches.
In addition to a custom, self-updating phage sequence library, PHAST also maintains a bacterial sequence library consisting of 1300 non-redundant bacterial genomes/proteomes from all major eubacterial and archaebacterial phyla. This bacterial sequence library contains more than four million annotated or partially annotated protein sequences. Relative to the full GenBank protein sequence library (100+ million sequences), this bacterial-specific library is 25× smaller. This means that PHAST's genome annotation step (see below) can be accomplished 25× faster.