Domain adaptation of Language Model and Tokenizer

Hey everyone,

I’m sure this will be a fairly straight forward question for the veterans out there. I want to fine tune a pre-trained MLM (BERT/RoBERTa) on my dataset. It’s a medical dataset though and I have over 3M documents I can use for fine-tuning.

If I want to start with one of the pre-trained RoBERTa models, there’s nothing I can do to re-train the tokenizer right? If I create a new tokenizer that fits my training dataset, I will lose the value in using the pre-trained models right?

Is it possible to take the existing tokenizer and expand it to incorporate the most common words in my dataset that aren’t already in the original tokenizer?

1 Like

RoBERTa uses byte-level BPE, so I would expect the pre-trained tokenizer will do well for your task (and likely isn’t that different than if you trained your own). The tokens in byte-level BPE are more granular than words, so unless your dataset includes a ton of unicode characters the words in the document should be handled well by the tokenizer. And yes, the new learned encodings would not play nice with the original RoBERTa model, so if you did choose to train an entirely new encoding scheme, you would need to train a new model too.
Just in case you weren’t already considering it, if your dataset is being trained on private medical information you should be especially mindful of privacy. It’s an active field, but there has been some research on differential privacy in ML if you want to check.

Source/explanation links:

Thanks for confirmation. I’m currently having some issues with Token Classification and the entity boundaries being split down the middle of some words that are probably not in the original tokenizer. I am tinkering with different aggregation strategies, but was wondering if there were techniques to add common words in my new dataset to the tokenizer and just initialize the weights randomly and train them in via fine-tuning. I know I can manually add words to my tokenizer, but I need to add a lot more than just one or two.

And the dataset does not contain any PHI. We’re a very PHI conscious team, but thanks for the heads up!

You can definitely add the new words as new tokens to the tokenizer, and you can definitely train the embedding of those new tokens via fine-tuning. I haven’t seen how to manually initialize them, but I imagine it’s also possible? The following might be helpful:

Sorry I don’t have more to add; hopefully someone with more experience chimes in!

All good! I did some more thought experimentation and I think the best approach would be:

  1. Train a tokenizer from scratch on my dataset
  2. Print out the vocabulary from the original tokenizer and the new tokenizer
  3. Find the differences between the two vocabularies
  4. Load the original tokenizer and manually add all the new words from the new vocabulary

I just need to figure out how to do this with BPE Tokenizer that RoBERTa uses. That said, I was able to solve one of the issues I was having with the RoBERTa tokenizer and NER preds cutting words apart so this is less of a concern for me now!

How did you solve this problem of the RoBERTa tokenizer splitting apart ner preds? I have domain adapted a swedish BERT to historical text, and have the same problem of splitting apart NER preds when finetuned on a standard NER dataset.

Hey sorry just logged back in and saw this. I’m using the pipeline() function for inference, so I’m able to leverage the built-in aggregation strategies. I have also previously manually implemented a similar function that aggregates the NER preds to override when RoBRTa breaks up NER preds.

I recommend taking a look at how pipeline() uses aggregation_strategy for combining NER predictions. That’s a great place to start. Don’t be like me and manually brute-force your own implementation haha.

Hi, ok, thanks very much for the reply, I’ll look in to the aggregation strategies of pipeline()