All protein domain information was extracted from the Pfam database (Finn et al., 2016 (link)). We used the Pfam 33.1 version of May 2020, containing 18,259 entries. Among these domains, we used the Pfam-A subset of 18,101 curated domains for further analysis (Sonnhammer et al., 1997 (link)). Each alignment was filtered to remove information from Archaea, bacteria, viruses, and other sequences to retain only data from eukaryotes. Overall, they contain information from 27,077,043 domains from 1,161 species. The human protein domains were extracted from the canonical Uniprot transcripts used in Pfam and represent 5,168,776 amino acids out of the 12,871,017 amino acids in human proteins (40.2%).
For each residue, we then calculated an amino acid value using the following steps: creation of a count matrix, a corrected frequency matrix, a corrected relative frequency matrix, and the position score matrix.
Free full text: Click here