Hi there, I have a few simple questions about the tokenizer when I’m performing a further pretraining/fine-tuning of a MLM model. I would love to have a feedback from you. Thanks in advance
TL;DR
Considering the task of further pretraining a model in a domain-specific dataset:
- How can I know if I need to perform any kind of customization to the original model’s tokenizer?
- What kind of update should I perform to the tokenizer?
- Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?
Context
I’m planning to further pretrain (a.k.a. fine-tune) a BERT language model in a domain-specific dataset in the same language. The general idea is to use the pretrained BERT model and to especialize it in my dataset to increase performance in future downstream tasks, like text classification. For this, my plan is to use the example script provided by Hugging Face, which seems to be very popular and standard for this pretrain task.
Questions
Because my domain-specific dataset is specialized, with many specific and technical words, it’s possible I’ll have to deal with the tokenizer. I don’t know if I must change the tokenizer and, if positive, which changes I should perform.
I noticed the example script assumes an already defined tokenizer. The only thing the script changes is one of the dimensions of the token embeddings’ matrix of the model in the case that the original tokenizer has been modified:
Here it comes my first question: how can I know if I need to perform any kind of customization to the original model’s tokenizer? Is there any way or metric to quantify if the standard model’s tokenizer is appropriated for my dataset? Or is it sufficient to explore how the tokenizer behaves in my dataset?
In the case I will need to update the tokenizer, what kind of update should I perform? In my understanding, if I train the tokenizer from scratch, all current model’s token embeddings will be useless and part of my MLM training will be retraining these vectors. So I’m assuming retraining the tokenizer is not the right thing to do. Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?
I think I exposed all of my questions, thanks in advance