Extant and ancestral protein sequence space was visualized by machine-learning embeddings, where each protein embedding represents protein features in a fixed-size, multidimensional vector space. The analysis was conducted on concatenated (HDK) nitrogenase protein sequences in our phylogenetic dataset. The embeddings were obtained using the pre-trained language model ESM2 (Lin et al., 2022 (link); Rives et al., 2021 (link)), a transformer architecture trained to reproduce correlations at the sequence level in a dataset containing hundreds of millions of protein sequences. Layer 33 of this transformer was used, as recommended by the authors. The resulting 1024 dimensions were reduced by UMAP (McInnes et al., 2020 ) for visualization in a two-dimensional space.
Protein site-wise conservation analysis was performed using the Consurf server (Ashkenazy et al., 2016 (link)). An input alignment containing only extant, Group I Mo-nitrogenases was submitted for analysis under default parameters. Conserved sites were defined by a Consurf conservation score >7.