The raw reads in a metagenomic sample are mapped by MetaPhlAn 3 to a database of 1.1M markers using bowtie2 (Langmead and Salzberg, 2012 (link)). The default bowtie2 mapping parameters are those of the ‘very-sensitive’ preset but are customizable via the MetaPhlAn 3 settings. In MetaPhlAn 3, the input can be provided as a single FASTQ file (optionally compressed), multiple FASTQs in a single archive, or as a pre-performed mapping. Internally, MetaPhlAn 3 estimates the coverage of each marker and computes the clade’s coverage as the robust average of the coverage across the markers of the same clade. The clade’s coverages are then normalized across all detected clades to obtain the relative abundance of each taxon as previously described (Segata et al., 2012 (link); Truong et al., 2015 (link)).
In version 3, we further optimized the parameter of the robust average which excludes the top and bottom quantiles of the marker abundances (‘stat_q’ parameter). This is now set by default to 0.2 (i.e. excludes the 20% of markers with the highest abundance as well as the 20% of markers with the lowest abundance). To further improve the quality of the read mapping, we adopted quality controls before and after mapping by discarding low-quality sequences and alignments (reads shorter than 70 bp and alignment with a MAPQ value less than 5).
We also introduced a new feature for estimating the ‘unknown’ portion of the taxonomic profile that would correspond with taxa not present in current databases; this is computed by subtracting from the total number of reads the average read depth of each taxon normalized by its taxon-specific average genome length. Additionally, the new output format for MetaPhlAn 3 by default includes the NCBI taxonomy ID of each profiled clade, allowing for better comparisons between tools and tracking of the species name in case of taxonomic reassignment.
Finally, alongside the default MetaPhlAn output format, profiles can be now reported using the CAMI output format defined by Belmann et al., 2015 (link); BioBoxes, 2020 that can be used for performing benchmarks with the OPAL framework (Meyer et al., 2019 (link)). To support post-profiling analyses, a convenience R script for computing weighted and unweighted UniFrac distances (Lozupone and Knight, 2005 (link)) from MetaPhlAn profiles is now available in the software repository (metaphlan/utils/calculate_unifrac.R), alongside the phylogeny (in Newick format) comprising all MetaPhlAn 3 taxa. The improvements and addition in MetaPhlAn 3 compared to the previous MetaPhlAn two version are summarized inSupplementary file 2 .
In version 3, we further optimized the parameter of the robust average which excludes the top and bottom quantiles of the marker abundances (‘stat_q’ parameter). This is now set by default to 0.2 (i.e. excludes the 20% of markers with the highest abundance as well as the 20% of markers with the lowest abundance). To further improve the quality of the read mapping, we adopted quality controls before and after mapping by discarding low-quality sequences and alignments (reads shorter than 70 bp and alignment with a MAPQ value less than 5).
We also introduced a new feature for estimating the ‘unknown’ portion of the taxonomic profile that would correspond with taxa not present in current databases; this is computed by subtracting from the total number of reads the average read depth of each taxon normalized by its taxon-specific average genome length. Additionally, the new output format for MetaPhlAn 3 by default includes the NCBI taxonomy ID of each profiled clade, allowing for better comparisons between tools and tracking of the species name in case of taxonomic reassignment.
Finally, alongside the default MetaPhlAn output format, profiles can be now reported using the CAMI output format defined by Belmann et al., 2015 (link); BioBoxes, 2020 that can be used for performing benchmarks with the OPAL framework (Meyer et al., 2019 (link)). To support post-profiling analyses, a convenience R script for computing weighted and unweighted UniFrac distances (Lozupone and Knight, 2005 (link)) from MetaPhlAn profiles is now available in the software repository (metaphlan/utils/calculate_unifrac.R), alongside the phylogeny (in Newick format) comprising all MetaPhlAn 3 taxa. The improvements and addition in MetaPhlAn 3 compared to the previous MetaPhlAn two version are summarized in