and optimizations to improve performance. Although an MPNN should
ideally be able to extract any information about
a molecule that might be relevant to predicting a given property,
two limitations may prevent this in practice. First, many property
prediction data sets are very small, i.e., on the order of only hundreds
or thousands of molecules. With so little data, MPNNs are unable to
learn to identify and extract all features of a molecule that might
be relevant to property prediction, and they are susceptible to overfitting
to artifacts in the data. Second, most MPNNs use fewer message passing
steps than the diameter of the molecular graph, i.e., T < diam(G), meaning atoms that
are a distance of greater than T bonds apart will
never receive messages about each other. This results in a molecular
representation that is fundamentally local rather than global in nature,
meaning the MPNN may struggle to predict properties that depend heavily
on global features.
In order to counter these limitations, we
introduce a variant of the D-MPNN that incorporates 200 global molecular
features that can be computed rapidly in silico using
RDKit. The neural network architecture requires that the features
are appropriately scaled to prevent features with large ranges dominating
smaller ranged features, as well as preventing issues where features
in the training set are not drawn from the same sample distribution
as features in the testing set. To prevent these issues, a large sample
of molecules was used to fit cumulative density functions (CDFs) to
all features. CDFs were used as opposed to simpler scaling algorithms
mainly because CDFs have the useful property that each value has the
same meaning: the percentage of the population observed below the
raw feature value. Min-max scaling can be easily biased with outliers,
and Z-score scaling assumes a normal distribution which is most often
not the case for chemical features, especially if they are based on
counts.
The CDFs were fit to a sample of 100k compounds from
the Novartis
internal catalog using the distributions available in the scikit-learn
package,45 a sample of which can be seen
in
do a similar normalization using publicly available databases such
as ZINC46 (link) and PubChem.47 (link) scikit-learn was used primarily due to the simplicity of
fitting and the final application. However, more complicated techniques
could be used in the future to fit to empirical CDFs, such as finding
the best fit general logistic function, which has been shown to be
successful for other biological data sets.48 (link) No review was taken to remove odd distributions. For example, azides
are hazardous and rarely used outside of a few specific reactions,
as reflected in the fr_azide distribution in
used for chemical screening against biological targets, the distribution
used here may not accurately reflect the distribution of reagents
used for chemical synthesis. For the full list of calculated features,
please refer to the
To incorporate these features, we modify the readout phase of the
D-MPNN to apply the feed-forward neural network f to the concatenation of the learned molecule feature vector h and the computed global features hf This
is a very general method of incorporating
external information and can be used with any MPNN and any computed
features or descriptors.