The algorithm is outlined below:
mark all heteroatoms in a molecule, including halogens
mark also the following carbon atoms: atoms connected by non-aromatic double or triple bond to any heteroatom atoms in nonaromatic carbon–carbon double or triple bonds acetal carbons, i.e. sp3 carbons connected to two or more oxygens, nitrogens or sulfurs; these O, N or S atoms must have only single bonds all atoms in oxirane, aziridine and thiirane rings (such rings are traditionally considered to be functional groups due to their high reactivity).
merge all connected marked atoms to a single FG
extract FGs also with connected unmarked carbon atoms, these carbon atoms are not part of the FG itself, but form its environment.
After marking all atoms that are part of FGs as described above, the identified FGs are extracted together also with their environment—i.e. connected carbon atoms, when the type of carbon (aliphatic or aromatic) is also preserved.
We do not claim that this algorithm provides an ultimate definition of FGs. Every medicinal chemist has probably a slightly different understanding about what a FG is. In particular the definition of activated sp3 carbons may create some discussion. In the present algorithm we restricted our definition only to classical acetal, thioacetal or aminal centers (i.e. sp3 carbons having at least 2 oxygens, sulfurs or nitrogens as neighbors) and did not consider other similar systems, i.e. alpha-substituted carbonyls or carbons connected to S=O or similar bonds. During the program development phase various such options have been tested, and this “strict” definition provided the most satisfactory results. Extension of FGs also to alpha-substituted carbonyls (i.e. heteroatom or halogen in alpha position to carbonyl) and similar systems more than triple the number of FGs identified, generating many large and rare FGs. Since our major interest was in comparing various molecular datasets and not in reactivity estimation we implemented this strict definition of acetal carbons. To assess the possible reactivity of molecules, various substructures filters are available, as for example already mentioned PAINS [9 ] or Eli Lilly rules [10 (link)].
To illustrate better the algorithm some examples of FGs identified for few simple molecules are shown in Fig.
Example of functional groups identified. Groups are color coded according to their type