For tokenization, BioBERT uses WordPiece tokenization (Wu et al., 2016 ), which mitigates the out-of-vocabulary issue. With WordPiece tokenization, any new words can be represented by frequent subwords (e.g. Immunoglobulin => I ##mm ##uno ##g ##lo ##bul ##in). We found that using cased vocabulary (not lower-casing) results in slightly better performances in downstream tasks. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERT
BioBERT: Biomedical Language Representation
For tokenization, BioBERT uses WordPiece tokenization (Wu et al., 2016 ), which mitigates the out-of-vocabulary issue. With WordPiece tokenization, any new words can be represented by frequent subwords (e.g. Immunoglobulin => I ##mm ##uno ##g ##lo ##bul ##in). We found that using cased vocabulary (not lower-casing) results in slightly better performances in downstream tasks. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERT
Corresponding Organization :
Other organizations : Korea University
Protocol cited in 85 other protocols
Variable analysis
- Text corpora used for pre-training of BioBERT (e.g. PubMed abstracts (PubMed), PubMed Central full-text articles (PMC), Wikipedia + BooksCorpus)
- Performance of NLP models on biomedical text mining tasks
- Tokenization method (WordPiece tokenization)
- Vocabulary used (Original BERT vocabulary)
Annotations
Based on most similar protocols
As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.
About PubCompare
Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.
We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.
However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.
Ready to get started?
Sign up for free.
Registration takes 20 seconds.
Available from any computer
No download required
Revolutionizing how scientists
search and build protocols!