BioBERT: Biomedical Language Representation

As a general purpose language representation model, BERT was pre-trained on English Wikipedia and BooksCorpus. However, biomedical domain texts contain a considerable number of domain-specific proper nouns (e.g. BRCA1, c.248T>C) and terms (e.g. transcriptional, antimicrobial), which are understood mostly by biomedical researchers. As a result, NLP models designed for general purpose language understanding often obtains poor performance in biomedical text mining tasks. In this work, we pre-train BioBERT on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). The text corpora used for pre-training of BioBERT are listed in Table 1, and the tested combinations of text corpora are listed in Table 2. For computational efficiency, whenever the Wiki + Books corpora were used for pre-training, we initialized BioBERT with the pre-trained BERT model provided by Devlin et al. (2019) . We define BioBERT as a language representation model whose pre-training corpora includes biomedical corpora (e.g. BioBERT (+ PubMed)).
For tokenization, BioBERT uses WordPiece tokenization (Wu et al., 2016 ), which mitigates the out-of-vocabulary issue. With WordPiece tokenization, any new words can be represented by frequent subwords (e.g. Immunoglobulin => I ##mm ##uno ##g ##lo ##bul ##in). We found that using cased vocabulary (not lower-casing) results in slightly better performances in downstream tasks. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERT_BASE for the following reasons: (i) compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT and (ii) any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

Free full text: Click here

Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H, & Kang J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.

Publication 2019

Antimicrobial Brca1 Domain Immunoglobulin Tasks performance Transcriptional

Corresponding Organization :

Other organizations : Korea University

Top 5 similar protocols

Protocol cited in 85 other protocols

Variable analysis

independent variables

Text corpora used for pre-training of BioBERT (e.g. PubMed abstracts (PubMed), PubMed Central full-text articles (PMC), Wikipedia + BooksCorpus)

dependent variables

Performance of NLP models on biomedical text mining tasks

control variables

Tokenization method (WordPiece tokenization)
Vocabulary used (Original BERT vocabulary)

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!