The generative model has two modes of processing sequences—training and generating. At each time step, in the training mode, the generative network takes a current prefix of the training object and predicts the probability distribution of the next character. Then, the next character is sampled from this predicted probability distribution and is compared to the ground truth. Afterward, on the basis of this comparison, the cross-entropy loss function is calculated, and parameters of the model are updated. At each time step, in generating mode, the generative network takes a prefix of already generated sequences and then, like in the training mode, predicts the probability distribution of the next character and samples it from this predicted distribution. In the generative model, we do not update the model parameters.
At the second stage, we combined both generative and predictive models into one RL system. In this system, the generative model plays the role of an agent, whose action space is represented by the SMILES notation alphabet, and state space is represented by all possible strings in this alphabet. The predictive model plays the role of a critic estimating the agent’s behavior by assigning a numerical reward to every generated molecule (that is, SMILES string). The reward is a function of the numerical property calculated by the predictive model. At this stage, the generative model is trained to maximize the expected reward. The entire pipeline is illustrated in
We trained a Stack-RNN as a generative model. As mentioned above, for training, we used the ChEMBL database of drug-like compounds. ChEMBL includes approximately 1.5 million of SMILES strings; however, we only selected molecules with the lengths of SMILES string of fewer than 100 characters. The length of 100 was chosen because more than 97% of SMILES in the training data set had 100 characters or less (see fig. S6).