Using a pretrained tokenizer vs training a one from scratch


For domain-specific data, let’s say medical drug data with complicated chemical compounds names. Would it be beneficial to train a tokenizer on the text if the size was nearly 18 M entries? In the bioBERT paper, they used a pre-trained BERT paper for the following reasons:

  • compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT
  • any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

How many different chemical compound names are there in the 18 M entries?

Having lots of data is good, but I don’t think training a tokenizer would help you unless the words you are interested in are frequent enough to be selected for the tokenizer’s vocabulary. I’m not sure how the tokenizer chooses its vocabulary, but word-frequency must be important. I’m guessing that “medical drug data” would still include lots of normal-English words, many of which would be more frequent than the chemicals.

[I am not an expert].