How to "further pretrain" a tokenizer (do I need to do so?)

Hi there, I have a few simple questions about the tokenizer when I’m performing a further pretraining/fine-tuning of a MLM model. I would love to have a feedback from you. Thanks in advance :hugs:

TL;DR

Considering the task of further pretraining a model in a domain-specific dataset:

  • How can I know if I need to perform any kind of customization to the original model’s tokenizer?
  • What kind of update should I perform to the tokenizer?
  • Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?

Context

I’m planning to further pretrain (a.k.a. fine-tune) a BERT language model in a domain-specific dataset in the same language. The general idea is to use the pretrained BERT model and to especialize it in my dataset to increase performance in future downstream tasks, like text classification. For this, my plan is to use the example script provided by Hugging Face, which seems to be very popular and standard for this pretrain task.

Questions

Because my domain-specific dataset is specialized, with many specific and technical words, it’s possible I’ll have to deal with the tokenizer. I don’t know if I must change the tokenizer and, if positive, which changes I should perform.

I noticed the example script assumes an already defined tokenizer. The only thing the script changes is one of the dimensions of the token embeddings’ matrix of the model in the case that the original tokenizer has been modified:

Here it comes my first question: how can I know if I need to perform any kind of customization to the original model’s tokenizer? Is there any way or metric to quantify if the standard model’s tokenizer is appropriated for my dataset? Or is it sufficient to explore how the tokenizer behaves in my dataset?

In the case I will need to update the tokenizer, what kind of update should I perform? In my understanding, if I train the tokenizer from scratch, all current model’s token embeddings will be useless and part of my MLM training will be retraining these vectors. So I’m assuming retraining the tokenizer is not the right thing to do. Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?

I think I exposed all of my questions, thanks in advance :blush:

Possibly related

1 Like

Hi! :wave:

From what I know, and from here, BERT’s vocabulary has 994 [unused#] tokens, # being 0-993. These are token ids 1-998 excluding 100, 101, 102, 103 which are BERT’s special tokens. You could change those tokens and fine-tune BERT so it will pick up on the new tokens, without needing to pre-train BERT from scratch.

1 Like

Hi there, thanks for the answer!

This is a possible solution indeed. An inconvenience would be to choose the top 994 words (and check if it’s enough), but it’s solvable.

You said (also supported by your reference):

If I train a new tokenizer from scratch, would I really need to pretrain BERT from scratch?

Yes, as far as I know. BERT relies on the fact that token id 12,476 is “awesome” and not something else. New tokenizer means new token \leftrightarrow id mapping, and suddenly token id 12,476 is no longer “awesome”, so BERT will need to go through all its pre-training data to learn the contexts of the new token 12,476.

1 Like

Maybe there’s a way to train a new tokenizer but keeping the old tokenizer’s token-ID mappings? Have you seen something like this before?

Sorry, I haven’t…

1 Like