Sulfatase sequences were extracted from the UniProt database in August 2009 using the BlastP program [70 (link)]. Alkylsulfohydrolases (370 proteins) and arylsulfohydrolases (15 proteins), which belong to the metallo-β-lactamase superfamily, were identified by at least 30% sequence identity over ~600 residues with the characterized enzymes alkylsulfatase SdsA1 (Uniprot code: Q9I5I9) and arylsulfatase AtsA (P28607), respectively, and by the presence of the pattern HxHxDH, which is involved in the coordination of two catalytic zinc ions. Fe αKG-dependent alkylsulfodioxygenases (111 proteins) were identified by at least 30% sequence identity over ~300 residues with the characterized alkylsulfodioxygenase AtsK (Q9WWU5) and by the presence of the pattern HxD/ExnH (n = 39 to 154) involved in the coordination of the Fe ion [23 (link)]. The extracted sulfatase sequences were subjected to multiple sequence alignments using the MAFFT [71 (link)] program, with the iterative refinement method L-INS-i and the scoring matrix Blosum62. Complete sets of orthologous alkysulfohydrolases and arylsulfohydrolases on one hand, and alkylsulfodioxygenases on the other hand, were classified based on phylogenetic analyzes using the metallo-β-lactamases and Fe αKG-dependent dioxygenase superfamilies, respectively.
The identification of FGly-sulfatases (4058 proteins) was based on a significant level of sequence identity of at least 25% with characterized enzymes (Table 1 ) over a minimal length compatible with the size of the known FGly-sulfatases (at least 400 residues), and by the conservation of the two PROSITE signatures PS00523 and PS00149 which correspond to the simplified patterns [SAPG]-[LIVMST]-[CS]-[STACG]-P-[STA]-R-x(2)-[LIVMFW](2)-[TAR]-G and G-[YV]-x-[ST]-x(2)-[IVAS]-G-K-x(0,1)-[FYWMK]-[HL], respectively [30 (link), 31 (link)]. The proteins encompassing several FGly-sulfatase modules were divided into distinct sequences corresponding to each catalytic module. Due to the huge number of sequences, it is impossible to directly obtain a reliable multiple alignment of this large group of sequences. Therefore, the FGly-sulfatase sequences were first divided into 81 groups and 32 orphan sequences, on the basis of sequence identities using the BlastP program. A multiple sequence alignment was obtained for each of these groups using MAFFT [71 (link)] with the iterative refinement method L-INS-i and the scoring matrix Blosum62. Then these 81 multiple sequence alignments were manually stacked on each other by matching similar zones using Jalview [72 (link)]. The alignments were manually improved using Jalview on the basis of the sequence alignment derived from the superposition of available crystal structures of sulfatases (Table 1 ). After this refinement step, the poorly conserved regions were removed from the multiple sequence alignment. The different phylogenetic trees were derived from these refined alignments using Maximum Likelihood method with the program RAxML with the MTMAMF or WAG as substitution matrix [73 (link)] or with the program MEGA 5.2.2 [74 (link)]. The reliability of the trees was always tested by bootstrap analysis using 100 resamplings of the dataset. The trees were displayed with MEGA 5.2.2 [74 (link)]. For the FGly-sulfatase sequences, the program MatGat [75 (link)] was used and two identity matrices were generated, one for the full length proteins and the second matrix corresponding to the edited multiple sequence alignment. The logo sequences were built using WebLogo via the PROSITE databank [76 (link)].
The identification of FGly-sulfatases (4058 proteins) was based on a significant level of sequence identity of at least 25% with characterized enzymes (