In building the training set, we used an advanced method based on Markovian Jensen–Shannon divergence (MJSD) to obtain the core (native) components of all available prokaryotic genomes to ensure the most balanced representation was used in our regression. We were able to significantly reduce the runtime of genome segmentation and clustering algorithm, as implemented in IslandCafe [27 (link)], by introducing a reverse-calculation step during recursive segmentation. MJSD, entropy, and statistical significance were calculated as described in [27 (link)]. Specifically, the information content of a genome sequence, quantified by the entropy function for probability distribution pi, is obtained as, Hmpi=-wPwxAP(x|w)log2P(x|w) , where P(x|w) is the probability of nucleotide x given the preceding oligonucleotide w of length m (m defines the model order, is set to 2 in IslandCafe) and P(w) is the probability of oligonucleotide w. A genome is initially segmented by iterating the computation of entropy and thus MJSD at each position along the genome and identifying the location of highest MJSD of (user-defined) significance in the genome. This process is then iterated for the resulting genomic segments.
Free full text: Click here