loaded into the database (2D only) as follows. We harvest tagged values
in selected source SDF files. Name and CAS numbers are loaded into
a synonyms table, while selected bioactivity and other selected data
are stored in a provided_values table. We convert SDF to SMILES98 using RDKit and take the largest organic part
of the compound (desalting), enumerating up to four stereoisomers
from stereochemically ambiguous SMILES using OEChem TK version 1.7
(OpenEye Scientific Software, Santa Fe, NM). Because of the combinatorial
problem of ambiguous stereocenters in sterols, we used SMARTS filters
to prioritize the most probable implied stereoisomers based on biosynthetic
pathways (Prof. Leslie Kuhn, private communication).99 The SMILES are neutralized with mitools (
are loaded using Python/RDKit scripts by attempting to map them to
existing ZINC IDs or creating new ZINC substances as necessary, as
well as any additional required datastructures. InChI and InChIkeys
are calculated on loading, and the InChIkey is used as a unique constraint
in the database. 512 bit Morgan fingerprints with radius 2 (effectively
ECFP4) are calculated for each molecule using RDKit.99