Enhancing Molecular Property Prediction

Next, we discuss further extensions
and optimizations to improve performance. Although an MPNN should
ideally be able to extract any information about
a molecule that might be relevant to predicting a given property,
two limitations may prevent this in practice. First, many property
prediction data sets are very small, i.e., on the order of only hundreds
or thousands of molecules. With so little data, MPNNs are unable to
learn to identify and extract all features of a molecule that might
be relevant to property prediction, and they are susceptible to overfitting
to artifacts in the data. Second, most MPNNs use fewer message passing
steps than the diameter of the molecular graph, i.e., T < diam(G), meaning atoms that
are a distance of greater than T bonds apart will
never receive messages about each other. This results in a molecular
representation that is fundamentally local rather than global in nature,
meaning the MPNN may struggle to predict properties that depend heavily
on global features.
In order to counter these limitations, we
introduce a variant of the D-MPNN that incorporates 200 global molecular
features that can be computed rapidly in silico using
RDKit. The neural network architecture requires that the features
are appropriately scaled to prevent features with large ranges dominating
smaller ranged features, as well as preventing issues where features
in the training set are not drawn from the same sample distribution
as features in the testing set. To prevent these issues, a large sample
of molecules was used to fit cumulative density functions (CDFs) to
all features. CDFs were used as opposed to simpler scaling algorithms
mainly because CDFs have the useful property that each value has the
same meaning: the percentage of the population observed below the
raw feature value. Min-max scaling can be easily biased with outliers,
and Z-score scaling assumes a normal distribution which is most often
not the case for chemical features, especially if they are based on
counts.
The CDFs were fit to a sample of 100k compounds from
the Novartis
internal catalog using the distributions available in the scikit-learn
package,⁴⁵ a sample of which can be seen
in Figure 2. One could
do a similar normalization using publicly available databases such
as ZINC^{46 (link)} and PubChem.^{47 (link)} scikit-learn was used primarily due to the simplicity of
fitting and the final application. However, more complicated techniques
could be used in the future to fit to empirical CDFs, such as finding
the best fit general logistic function, which has been shown to be
successful for other biological data sets.^{48 (link)} No review was taken to remove odd distributions. For example, azides
are hazardous and rarely used outside of a few specific reactions,
as reflected in the fr_azide distribution in Figure 2. As such, since the sample data was primarily
used for chemical screening against biological targets, the distribution
used here may not accurately reflect the distribution of reagents
used for chemical synthesis. For the full list of calculated features,
please refer to the Supporting Information.
To incorporate these features, we modify the readout phase of the
D-MPNN to apply the feed-forward neural network f to the concatenation of the learned molecule feature vector h and the computed global features h_f This
is a very general method of incorporating
external information and can be used with any MPNN and any computed
features or descriptors.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Yang K., Swanson K., Jin W., Coley C., Eiden P., Gao H., Guzman-Perez A., Hopper T., Kelley B., Mathea M., Palmer A., Settels V., Jaakkola T., Jensen K, & Barzilay R. (2019). Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling, 59(8), 3370-3388.

Publication 2019

Azide Biological Vector

Corresponding Organization :

Other organizations : IIT@MIT, BASF (Germany), Amgen (United States), Novartis (United States)

Top 5 similar protocols

Protocol cited in 78 other protocols

Variable analysis

independent variables

Number of message passing steps (T)
Use of 200 global molecular features computed using RDKit

dependent variables

Ability to extract relevant information for predicting a given property
Performance of the MPNN model

control variables

Scaling of the global molecular features to prevent features with large ranges from dominating smaller ranged features
Fitting cumulative density functions (CDFs) to a sample of 100k compounds from the Novartis internal catalog to normalize the global features

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!