We took the data published previously [14] (link), and for each protein family we repeated the analysis in that paper, i.e. we did PSI-BLAST of each key protein on chromosomes and plasmids and clustered the resulting proteins by MCL. This approach failed to produce good results because PSI-BLAST often did not converge in the searches made in chromosomes. For example, the searches for ATPases tend to put together many different ATPases of prokaryotes rendering their accurate separation difficult. We have thus used a different approach. For each protein family uncovered in our previous analysis of plasmids we did the following: (i) We carried out a multiple alignment with MUSCLE [72] (link) and built a phylogenetic tree using PHYML [73] (link). With these two pieces of evidence we removed the very few cases of extreme divergence, the proteins that were too short and the proteins that were too long (typically false positives, fusions or fissions of proteins motivated by sequencing errors or pseudogenization). (ii) We built multiple alignments with MUSCLE of the selected proteins, checked manually the alignments and trimmed them to remove poorly aligned regions at the edges, if relevant. The C-terminal regions of MOB alignments were systematically trimmed, as suggested previously [67] (link). The alignment of the T4CP family showed two conserved regions separated by a region that aligned poorly. As a result, we split this alignment in two and made separate profiles with the two conserved regions. In general the two profiles were found together but only the second was found to be present in all conjugative elements apart some of those of the Tn916 family. These latter T4CP showed poor matches to the general T4CP profiles and we built one specific profile for this family. (iii) We used HMMER 3.0 to build protein profiles from the manually curated multiple alignments.
Free full text: Click here