A SMILES [25 ] represents a molecule as a sequence of characters corresponding to atoms as well as special characters denoting opening and closure of rings and branches. The SMILES is, in most cases, tokenized based on a single character, except for atom types which comprise two characters such as “Cl” and “Br” and special environments denoted by square brackets (e.g [nH]), where they are considered as one token. This method of tokenization resulted in 86 tokens present in the training data. Figure 3 exemplifies how a chemical structure is translated to both the SMILES and one-hot encoded representations.

Three representations of 4-(chloromethyl)-1H-imidazole. Depiction of a one-hot representation derived from the SMILES of a molecule. Here a reduced vocabulary is shown, while in practice a much larger vocabulary that covers all tokens present in the training data is used

There are many different ways to represent a single molecule using SMILES. Algorithms that always represent a certain molecule with the same SMILES are referred to as canonicalization algorithms [26 (link)]. However, different implementations of the algorithms can still produce different SMILES.
Free full text: Click here