Four classes of 2D fingerprint types can be distinguished: (i) dictionary-based, (ii) topological or path-based, (iii) circular fingerprints and (iv) pharmacophores. In addition, fingerprints can differ in the atom types or feature classes used or the length of the bit string. In this study, 14 fingerprints belonging to three of the four classes were compared.
The public Molecular ACCess System (MACCS) structural keys
[43 ] are 166 predefined substructures defined as SMARTS and belong to the dictionary-based fingerprint class. They were originally designed for substructure search and typically show a low performance level for virtual screening, thus they are often used as baseline fingerprint for benchmarking studies.
Topological or path-based fingerprints describe combinations of atom types and paths between atom types. In atom pair (AP) fingerprints
[44 (link)], pairs of atoms together with the number of bonds separating them are encoded. In topological torsions (TT)
[37 (link)], on the other hand, four atoms forming a torsion are described. In both AP and TT fingerprints the atom type consists of the element, the number of heavy-atom neighbours and the number of π-electrons.
The RDKit fingerprint, a relative of the well-known Daylight fingerprint
[45 ], is another topological descriptor. Atom-types, the atomic number and aromaticity state, are combined with bond types to hash all branched and linear molecular subgraphs up to a particular size
[42 ]. In this study, a maximum path length of five (RDK5) was used.
Similar to the Daylight fingerprints, certain paths and feature classes of the molecular graph are enumerated and hashed in the Avalon fingerprint
[46 (link)]. There are 16 feature classes which were optimized for substructure search. A detailed description of the feature classes is given in Table
1 and the supplementary material of
[46 (link)].
Circular fingerprints were developed more recently
[47 (link)], and encode circular atom environments up to a certain bond radius from the central atom. If atom types consisting of the element, the number of heavy-atom neighbours, the number of hydrogens, the isotope and ring information are used these fingerprints are called extended-connectivity (EC) fingerprints. Alternatively, pharmacophoric features can be used, yielding functional connectivity (FC) fingerprints. We consider two representations of the fingerprints, bit strings (FP) and count vectors (FC). This gives four types of circular fingerprints: extended-connectivity bit string (ECFP), extended-connectivity count vector (ECFC), feature-connectivity bit string (FCFP) and feature-connectivity count vector (FCFC). The maximum bond length or diameter is added at the end to the name. In this study, the four types of circular fingerprints with a diameter 4, i.e. ECFP4, ECFC4, FCFP4 and FCFC4, as well as ECFP6 were compared. In addition, ECFC0, which is a kind of atom count, was used as a second baseline fingerprint.
For all bit-string fingerprints, a size of 1024 bits was used. However, Sastry et al. found that such a small bit space may result in many collisions which can affect VS performance
[21 (link)]. To investigate this effect a larger bit space, 16384 bits, was used for three fingerprints: long ECFP4 (lECFP4), long ECFP6 (lECFP6) and long Avalon (lAvalon).
All fingerprints were calculated using the RDKit.
Free full text: Click here