Structural homology models of ancestral sequences were generated by MODELLER v10.2 (Webb and Sali, 2016 (link)) using PDB 1M34 as a template for all nitrogenase protein subunits and visualized by ChimeraX v1.3 (Pettersen et al., 2021 (link)).
Extant and ancestral protein sequence space was visualized by machine-learning embeddings, where each protein embedding represents protein features in a fixed-size, multidimensional vector space. The analysis was conducted on concatenated (HDK) nitrogenase protein sequences in our phylogenetic dataset. The embeddings were obtained using the pre-trained language model ESM2 (Lin et al., 2022 (link); Rives et al., 2021 (link)), a transformer architecture trained to reproduce correlations at the sequence level in a dataset containing hundreds of millions of protein sequences. Layer 33 of this transformer was used, as recommended by the authors. The resulting 1024 dimensions were reduced by UMAP (McInnes et al., 2020 ) for visualization in a two-dimensional space.
Protein site-wise conservation analysis was performed using the Consurf server (Ashkenazy et al., 2016 (link)). An input alignment containing only extant, Group I Mo-nitrogenases was submitted for analysis under default parameters. Conserved sites were defined by a Consurf conservation score >7.
Free full text: Click here