The majority of FGs contain heteroatoms. Therefore our approach is based on processing heteroatoms and their environment with the addition of some other functionalities, like multiple carbon–carbon bonds.
The algorithm is outlined below:

mark all heteroatoms in a molecule, including halogens

mark also the following carbon atoms:

atoms connected by non-aromatic double or triple bond to any heteroatom

merge all connected marked atoms to a single FG

extract FGs also with connected unmarked carbon atoms, these carbon atoms are not part of the FG itself, but form its environment.

The algorithm described above iterates only through non-aromatic atoms. Aromatic heteroatoms are collected as single atoms, not as part of a larger system. They are extended to a larger FG only when there is an aliphatic functionality connected (for example an acyl group connected to a pyrrole nitrogen). Heteroatoms in heterocycles are traditionally not considered to be “classical” FGs by themselves but simply to be part of the whole heterocyclic ring. The rationale for such treatment is enormous diversity of heterocyclic systems. For example in our previous study [12 (link)] nearly 600,000 different heterocycles consisting of 1–3 fused 5- and 6- membered rings were enumerated.
After marking all atoms that are part of FGs as described above, the identified FGs are extracted together also with their environment—i.e. connected carbon atoms, when the type of carbon (aliphatic or aromatic) is also preserved.
We do not claim that this algorithm provides an ultimate definition of FGs. Every medicinal chemist has probably a slightly different understanding about what a FG is. In particular the definition of activated sp3 carbons may create some discussion. In the present algorithm we restricted our definition only to classical acetal, thioacetal or aminal centers (i.e. sp3 carbons having at least 2 oxygens, sulfurs or nitrogens as neighbors) and did not consider other similar systems, i.e. alpha-substituted carbonyls or carbons connected to S=O or similar bonds. During the program development phase various such options have been tested, and this “strict” definition provided the most satisfactory results. Extension of FGs also to alpha-substituted carbonyls (i.e. heteroatom or halogen in alpha position to carbonyl) and similar systems more than triple the number of FGs identified, generating many large and rare FGs. Since our major interest was in comparing various molecular datasets and not in reactivity estimation we implemented this strict definition of acetal carbons. To assess the possible reactivity of molecules, various substructures filters are available, as for example already mentioned PAINS [9 ] or Eli Lilly rules [10 (link)].
To illustrate better the algorithm some examples of FGs identified for few simple molecules are shown in Fig. 1.

Example of functional groups identified. Groups are color coded according to their type

Free full text: Click here