An initial list of 75 NifD/E and NifK/N-like sequences belonging to the PFAM family PF00148 were selected manually from the IMG database [33 (link)] (http://img.jgi.doe.gov) and then used as queries in a BLAST [32 (link)] search against the NCBI NR protein database with an e-value cut-off of 10−20. This returned 1117 unique geneIDs, which were then filtered against known NifD/E and NifK/N sequences (Additional file 2: Table S3) to remove hits to conventional nitrogenase. The remaining 900 unique gene IDs were further filtered with a BLAST search against ChlB (accession GenBank:AAT28195.1), BchB (SwissProt:Q3APL0.1), ChlN (GenBank:AAP99591.1) and BchN (SwissProt:Q3APK9.1) to remove homologs of protochlorophylide reductase. Fused protein sequences (NifHD/E) were also filtered out and were not subject to further phylogenetic analysis. Another filtering was done with a preliminary tree built using FastTree 2.1 [34 (link)] to identify very similar sequences; only one member of each set of similar sequences was kept. The final compilation contained 472 unique gene IDs.
Manual inspection of the 472-sequence tree yielded a “core” list of 73 representative sequences. These 73 sequences were then aligned with ClustalW version 2.1 [35 (link)] with the Gonnet 250 protein matrix and default pairwise alignment options. A phylogenetic tree was built with FastTree 2.1 [34 (link)] using the WAG + gamma20 likelihood model; the result is shown in Figure 4.
Free full text: Click here