To speed up the annotation procedure as well as increase the accuracy, we manually pre-collected a set of 353 suspiciously non-biological ligands, which are frequently used for the protein structure determination (including crystallization additives, non-biological ions, heavy metal and so on.) To generate this list, we first collected all ligands that are observed for >20 times in known protein structures. This list was refined further by analyzing the possible biological role of these ligands, e.g. a ligand is removed from the list if it is found to have biological relevance in the related literature of the structure file or is present in the KEGG database (26 (link)). This list is used to help assess the biological relevance of each ligand in PDB automatically (
The automated filtering procedure consists of four steps:
First, if the candidate ligand is in the artifact list and appears >15 times in the same structure file, then it is likely to be crystallization additive and is considered as biologically irrelevant.
Second, the contacts between the receptor and ligand atoms are computed. The record ‘REMARK 350’ in the asymmetric unit files is used to exclude crystallization neighbors. This record presents which chains of the structure should be put together and the mathematical transformations (i.e. rotation and translation matrices) operated on each chain to generate biomolecules (i.e. biological unit files). The contacts between two chains are evaluated only when both chains are used to generate a biomolecule. For a receptor residue, if the closest atomic distance between the residue and the ligand is within certain distance cutoff, then the residue is defined as a ligand-binding site residue. The cutoff is set to be 0.5 plus the sum of the Van der Waal’s radius of the two atoms under investigation (7 (link)). If the number of binding site residues (i.e. number of contacts) is less than two or all the binding site residues are consecutive, it is deemed to be biologically irrelevant because most biological relevant ligands are usually tethered by multiple residues, which are further apart in the sequence space.
Third, if the ligand is not present in the artifact list, then it is considered as biologically relevant and kept in the pipeline for further manual verifications.
Fourth, the PubMed abstract is used to filter out biologically irrelevant ligands. If the ligand is in the artifact list, the simplest way is to treat it as biologically irrelevant and discard it. But this will miss some ligands (false negatives) that are indeed biologically relevant in some cases. For instance, the ligand molecule ‘glycerol’ (with ligand ID ‘GOL’) is one of the most frequently used crystallization additives and it is thus regarded as biologically irrelevant by many existing databases. However, this ligand can have a biological role in some proteins. For example, the ligand molecule glycerol binds to the protein ‘enzyme diol dehydratase’ (PDB ID: 3AUJ) with binding affinity Km = 1.2 ± 0.02 mM with its biological role described as ‘glycerol is bound to the substrate binding site in the (β/α)8 or TIM barrel of the diol dehydratase α subunit’ in (27 (link)). Thus, this ligand is considered as biologically relevant for this protein and added to BioLiP. We found that if a ligand present in a protein has its relevant biological role, it is often mentioned in the PubMed abstract. Based on such observation, we propose to use the PubMed abstract as an additional filter. To this end, the chemical names/synonyms of the ligand (curated from ChEBI, PubChem and PDB databases) are compared with the PubMed abstract. If there is no hit in this comparison procedure, the ligand is deemed to be biologically irrelevant. Otherwise, the ligand is possible to be biologically relevant, which remains to be verified by hand in the next step.
Finally, the manual verification is performed to check for suspicious or ambiguous entries, which are referred to those entries related with the commonly used crystallization additives, such as glycerol, ethanol, methanol, 2-propanol, ethylene glycol, hexylene glycol and polyethylene glycol. Ligands filtered from the above four steps can sometimes still be false positives, which is usually caused by unexpected match between the ligand names/synonyms and the PubMed abstract. In the same example of the ligand ‘glycerol’, it has the synonym ‘glycyl alcohol’, which leads to an unexpected match of the term ‘alcohol’ for the protein ‘arylesterase’ (PDB ID: 3HI4). Therefore, manual verification for ligands that are commonly used as crystallization additives is necessary to ensure the quality of BioLiP. Currently, we do this manual verification mainly by reading the original literatures and consulting other secondary databases. In the current version of BioLiP, manual verifications helped us to remove ∼12 500 entries that were false positives and we added ∼3000 entries that would have been missed by using the automated procedure alone.