The genome sequencing revolution has radically altered the field of microbiology. Whole-genome sequencing for prokaryotes became a standard method of study ever since the first complete genome of free-living organism,
Haemophilus influenza, was sequenced in 1995 (14 (
link)). Due to the widespread use of the next generation sequencing (NGS) techniques, thousands of genomes of prokaryotic species are now available, including genomes of multiple isolates of the same species, typically human pathogens. Thus, the mere density of comparative genomic information for high interest organisms provides an opportunity to introduce a pan-genome based approach to prediction of the protein complement of a species.
The collection of prokaryotic genomes available at NCBI is growing exponentially and shows no signs of abating: as of January 2016 NCBI's assembly resource contains 57 890 genome assemblies representing 8047 species (see genome browser
https://www.ncbi.nlm.nih.gov/genome/browse/, for the up-to-date information). Notably, genomes of different strains of the same species can vary considerably in size, gene content and nucleotide composition. In 2005, Tettelin
et al. (15 (
link)) introduced the concept of
pan-genome, aiming to provide a compact description of the full complement of genes of all the strains of a species. Genes common to all pan-genome members (or to the vast majority of them) are called
core genes; those present in just a few clade members are termed
accessory or
dispensable genes; genes specific to a particular genome (strain) are termed
unique genes (16 (
link)).
In PGAP we define the pan-genome of a clade at a species or higher level (17 ). To be included as a
core gene for a species-level pan-genome, we require the gene to be present in the vast majority—at least 80%—of all genomes in the clade. A set of
core genes gives rise to a set of
core proteins. We show in Figure
1 how the number of protein clusters, for each of four well studied large clades, depends on the fraction of the clade members that contribute proteins to the cluster. There are three critical regions in this analysis: (i) unique genes, present in less than 1% of all clade members; (ii) dispensable genes, present in 1–20% of genomes; and (iii) core genes, found in at least 80% of the represented genomes. Based on our analysis, there are very few clusters appearing in at least 20% of the members of a clade but no more than 80% of the members. The use of a cutoff of 80% was chosen to capture a wide set of genes conserved within the whole clade while eliminating genes having less abundant representation. We further subject the
core proteins to clustering using USearch to reduce the total number of proteins required to represent the full protein complement of the pan-genome (18 (
link)). We use the representative
core proteins to infer genes for homologous core proteins in a newly sequenced genome (19 ).
The notion of the
pan-genome can be generalized beyond a species level and applies, in fact, to any taxonomy level (from genus to phylum to kingdom). Notably, in the pan-genomes of Archaea and Bacteria, the universally conserved ribosomal genes make a group of core genes. The main practical value of the pan-genome approach is in formulating an efficient framework for comparative analysis of large groups of closely related organisms separated by small evolutionary distances as defined by ribosomal protein markers (20 (
link),21 (
link)).
Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M, & Ostell J. (2016). NCBI prokaryotic genome annotation pipeline. Nucleic Acids Research, 44(14), 6614-6624.