We used PRISM 4 and antiSMASH 5 to predict the chemical structures of secondary metabolites encoded within 3759 complete bacterial genomes and 6362 metagenome-assembled genomes (MAGs). All bacterial genomes with an assembly level of ‘Complete’ were downloaded from NCBI Genome, and a set of dereplicated genomes as determined by the Genome Taxonomy Database15 (link) were retained to mitigate the impact of highly similar genomes on our analysis. Similarly, a set of 7902 MAGs23 (link) was obtained from NCBI BioProject (accession PRJNA348753) and the subset of dereplicated genomes was retained. Detected BGCs were matched between PRISM and antiSMASH if their nucleotide sequence overlapped over any range. A small number of PRISM BGC types were mapped to more than one antiSMASH BGC type, including aminoglycosides (‘amglyccycl’ and ‘oligosaccharide’), type I polyketides (‘t1pks’ and ‘transatpks’), and RiPPs (‘bottromycin’, ‘cyanobactin’, ‘glycocin’, ‘head_to_tail’, ‘LAP’, ‘lantipeptide’, ‘lassopeptide’, ‘linaridin’, ‘microviridin’, ‘proteusin’, ‘sactipeptide’, and ‘thiopeptide’). The “hybrid” category encompassed all BGCs assigned any combination of two or more cluster types, i.e., it was not limited to hybrid NRPS-PKS BGCs. The “other” category encompassed aryl polyenes, bacteriocins, butyrolactones, ectoines, furans, homoserine lactones, ladderanes, melanins, N-acyl amino acids, NRPS-independent siderophores, phenazines, phosphoglycolipids, resorcinols, stilbenes, terpenes, and type III polyketides. Producing organism taxonomy was based on genome phylogeny and retrieved from the Genome Taxonomy Database15 (link).
Cheminformatic metrics, including molecular weight, number of hydrogen bond donors and acceptors, octanol-water partition coefficients, and Bertz topological complexity, were calculated in RDKit. Both platforms occasionally generated very small, non-specific structure predictions (for example, a single unspecified amino acid or a single malonyl unit) that did not provide actionable information about the chemical structure of the encoded product; to remove these from consideration, we applied a molecular weight filter to remove structures under 100 Da output by either platform. To evaluate the internal structural diversity of each set of predicted structures, we computed the distribution of pairwise Tcs for each set45 , taking the median pairwise Tc instead of the mean as a summary statistic to ensure robustness against outliers. Structural similarity to known natural products was assessed using the RDKit implementation of the ‘natural product-likeness’ score22 (link), and by the median Tc between predicted structures and the known secondary metabolite structures deposited in the NP Atlas database46 (link).
Free full text: Click here