We compared DNorm against several strong baseline methods. An exact string-matching method checks for matches of the disease names in text with controlled terminology terms and is therefore expected to have difficulty with term variability, especially if such variations were not foreseen during the creation of the lexicon. In addition, precision may be affected by ambiguous or nested terms. Norm, from the SPECIALIST lexical tools (http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/userDoc/tools/norm.html ) is a publically available resource of the National Library of Medicine, and is designed to address these issues by normalizing case, plurals, inflections and word order. We used Norm to process all disease names and synonyms in MEDIC and also the set of all strings and substrings of any given PMID document in the NCBI disease corpus. When a text string found in a PubMed abstract in the NCBI testing set was mapped by Norm to a disease name in the MEDIC lexicon, that disease mention is grounded with the corresponding MEDIC concept. For nested disease mentions we kept the longest string that produced a mapping to a MEDIC entry term or synonym. The results of this string matching method are reported as NLM Lexical Normalization in the ‘Results’ section.
Our second baseline method applied MetaMap (Aronson, 2001 ). MetaMap is another public resource of the National Library of Medicine, and the state-of-the-art natural language processing tool for identifying UMLS Metathesaurus concepts in biomedical text. MetaMap first splits the input text into sentences, and then splits the set of sentences into phrases. For each phrase, MetaMap identifies possible mappings to UMLS based on lexical lookup and on variants by associating a score with each one of them. MetaMap identifies several possible mappings in each phrase and several candidates for each one. In this work, we used MetaMap to identify all UMLS concept identifiers (CUI) in the PubMed abstracts composing the NCBI disease corpus. Then, for each abstract, we used UMLS to map the CUIs to their respective MeSH descriptors and OMIM identifiers. We retained the CUIs we were able to map to either MeSH or OMIM IDs in MEDIC and dropped all others. These results are reported as MetaMap.
We also compare with the benchmark results on the NCBI disease corpus, obtained using the Inference method (Islamaj Doğan and Lu, 2012b ). This method was developed on a manually annotated set of PubMed abstract sentences that reflected the consensus annotation agreement of the EBI disease corpus and the AZDC disease corpus (the only available data at the time). The Inference method showed F-measure results of 79%, and it was able to link disease mentions to their corresponding medical vocabulary entry with high precision. Its basis was a Lucene search that first mapped a disease mention against the MEDIC vocabulary. Next, the Inference method makes use of a combination of rules that were used to re-rank the results to report the top ranked one. The core of the Inference method was built as a combination of string matching rules that mapped the text annotated strings to the controlled vocabulary terms. A strong advantage of the Inference method was its incorporation of abbreviation definition detection and the successful use of the fact that the long form of the disease is usually defined elsewhere in the same document. Once the abbreviation was resolved, the knowledge of the mapping of the long form of the disease was used to infer the mapping of the abbreviated mention. To evaluate the Inference method’s performance, BANNER was first applied to each PubMed abstract to identify disease name strings, the Inference method was then applied to normalize each mention to a MEDIC concept.
Our next baseline method uses the same processing pipeline as our DNorm method but replaced our candidate generation method with Lucene, an important component in several previous systems for normalizing biomedical entities (Huang et al., 2011a (link); Wermter et al., 2009 (link)). We loaded MEDIC into a Lucene repository, creating one Lucene document for each concept–name pair. Mentions and names are both processed with the same tokenization and string normalization used in DNorm. A Boolean query is created from the resulting tokens, and the concept for the highest-scoring name is the one returned. We refer to this method as BANNER + Lucene.
Our final baseline method, which we refer to as BANNER + cosine similarity, also uses the same processing pipeline as DNorm. However, this method also uses the same TF-IDF vectors as DNorm for the mentions and names, so that the only difference is the scoring function. The cosine similarity scoring function is as follows:
Because this method is equivalent to DNorm with , the identity matrix, and is the value of before training, this method isolates the improvement provided by training the matrix with pLTR.
Our second baseline method applied MetaMap (Aronson, 2001 ). MetaMap is another public resource of the National Library of Medicine, and the state-of-the-art natural language processing tool for identifying UMLS Metathesaurus concepts in biomedical text. MetaMap first splits the input text into sentences, and then splits the set of sentences into phrases. For each phrase, MetaMap identifies possible mappings to UMLS based on lexical lookup and on variants by associating a score with each one of them. MetaMap identifies several possible mappings in each phrase and several candidates for each one. In this work, we used MetaMap to identify all UMLS concept identifiers (CUI) in the PubMed abstracts composing the NCBI disease corpus. Then, for each abstract, we used UMLS to map the CUIs to their respective MeSH descriptors and OMIM identifiers. We retained the CUIs we were able to map to either MeSH or OMIM IDs in MEDIC and dropped all others. These results are reported as MetaMap.
We also compare with the benchmark results on the NCBI disease corpus, obtained using the Inference method (Islamaj Doğan and Lu, 2012b ). This method was developed on a manually annotated set of PubMed abstract sentences that reflected the consensus annotation agreement of the EBI disease corpus and the AZDC disease corpus (the only available data at the time). The Inference method showed F-measure results of 79%, and it was able to link disease mentions to their corresponding medical vocabulary entry with high precision. Its basis was a Lucene search that first mapped a disease mention against the MEDIC vocabulary. Next, the Inference method makes use of a combination of rules that were used to re-rank the results to report the top ranked one. The core of the Inference method was built as a combination of string matching rules that mapped the text annotated strings to the controlled vocabulary terms. A strong advantage of the Inference method was its incorporation of abbreviation definition detection and the successful use of the fact that the long form of the disease is usually defined elsewhere in the same document. Once the abbreviation was resolved, the knowledge of the mapping of the long form of the disease was used to infer the mapping of the abbreviated mention. To evaluate the Inference method’s performance, BANNER was first applied to each PubMed abstract to identify disease name strings, the Inference method was then applied to normalize each mention to a MEDIC concept.
Our next baseline method uses the same processing pipeline as our DNorm method but replaced our candidate generation method with Lucene, an important component in several previous systems for normalizing biomedical entities (Huang et al., 2011a (link); Wermter et al., 2009 (link)). We loaded MEDIC into a Lucene repository, creating one Lucene document for each concept–name pair. Mentions and names are both processed with the same tokenization and string normalization used in DNorm. A Boolean query is created from the resulting tokens, and the concept for the highest-scoring name is the one returned. We refer to this method as BANNER + Lucene.
Our final baseline method, which we refer to as BANNER + cosine similarity, also uses the same processing pipeline as DNorm. However, this method also uses the same TF-IDF vectors as DNorm for the mentions and names, so that the only difference is the scoring function. The cosine similarity scoring function is as follows:
Because this method is equivalent to DNorm with , the identity matrix, and is the value of before training, this method isolates the improvement provided by training the matrix with pLTR.
Full text: Click here