How to "further pretrain" a tokenizer (do I need to do so?)

lucasresck · February 14, 2022, 5:55pm

Hi there, I have a few simple questions about the tokenizer when I’m performing a further pretraining/fine-tuning of a MLM model. I would love to have a feedback from you. Thanks in advance

TL;DR

Considering the task of further pretraining a model in a domain-specific dataset:

How can I know if I need to perform any kind of customization to the original model’s tokenizer?
What kind of update should I perform to the tokenizer?
Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?

Context

I’m planning to further pretrain (a.k.a. fine-tune) a BERT language model in a domain-specific dataset in the same language. The general idea is to use the pretrained BERT model and to especialize it in my dataset to increase performance in future downstream tasks, like text classification. For this, my plan is to use the example script provided by Hugging Face, which seems to be very popular and standard for this pretrain task.

Questions

Because my domain-specific dataset is specialized, with many specific and technical words, it’s possible I’ll have to deal with the tokenizer. I don’t know if I must change the tokenizer and, if positive, which changes I should perform.

I noticed the example script assumes an already defined tokenizer. The only thing the script changes is one of the dimensions of the token embeddings’ matrix of the model in the case that the original tokenizer has been modified:

github.com

huggingface/transformers/blob/b090b790228bbe420f1667f8b0335c8b8e5bb5eb/examples/pytorch/language-modeling/run_mlm.py#L361

      
        
                    from_tf=bool(".ckpt" in model_args.model_name_or_path),
                    config=config,
                    cache_dir=model_args.cache_dir,
                    revision=model_args.model_revision,
                    use_auth_token=True if model_args.use_auth_token else None,
                )
            else:
                logger.info("Training new model from scratch")
                model = AutoModelForMaskedLM.from_config(config)
            
            
model.resize_token_embeddings(len(tokenizer))
            
            
# Preprocessing the datasets.
            # First we tokenize all the texts.
            if training_args.do_train:
                column_names = raw_datasets["train"].column_names
            else:
                column_names = raw_datasets["validation"].column_names
            text_column_name = "text" if "text" in column_names else column_names[0]
            
            
if data_args.max_seq_length is None:

Here it comes my first question: how can I know if I need to perform any kind of customization to the original model’s tokenizer? Is there any way or metric to quantify if the standard model’s tokenizer is appropriated for my dataset? Or is it sufficient to explore how the tokenizer behaves in my dataset?

In the case I will need to update the tokenizer, what kind of update should I perform? In my understanding, if I train the tokenizer from scratch, all current model’s token embeddings will be useless and part of my MLM training will be retraining these vectors. So I’m assuming retraining the tokenizer is not the right thing to do. Is it possible to “further pretrain” the original tokenizer to especialize it in my dataset?

I think I exposed all of my questions, thanks in advance

Possibly related

beneyal · February 14, 2022, 6:31pm

Hi!

From what I know, and from here, BERT’s vocabulary has 994 [unused#] tokens, # being 0-993. These are token ids 1-998 excluding 100, 101, 102, 103 which are BERT’s special tokens. You could change those tokens and fine-tune BERT so it will pick up on the new tokens, without needing to pre-train BERT from scratch.

lucasresck · February 20, 2022, 2:59am

Hi there, thanks for the answer!

This is a possible solution indeed. An inconvenience would be to choose the top 994 words (and check if it’s enough), but it’s solvable.

You said (also supported by your reference):

If I train a new tokenizer from scratch, would I really need to pretrain BERT from scratch?

beneyal · February 20, 2022, 3:15am

Yes, as far as I know. BERT relies on the fact that token id 12,476 is “awesome” and not something else. New tokenizer means new token \leftrightarrow id mapping, and suddenly token id 12,476 is no longer “awesome”, so BERT will need to go through all its pre-training data to learn the contexts of the new token 12,476.

lucasresck · February 20, 2022, 7:18am

Maybe there’s a way to train a new tokenizer but keeping the old tokenizer’s token-ID mappings? Have you seen something like this before?

beneyal · February 20, 2022, 10:29am

Sorry, I haven’t…

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8426	November 14, 2024
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	3095	January 11, 2022
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	148	August 30, 2024
Pre-training & fine-tuning BERT on specific domain with custom dataset Beginners	4	4266	August 10, 2021
Domain adaptation of Language Model and Tokenizer Beginners	8	2867	June 17, 2024

How to "further pretrain" a tokenizer (do I need to do so?)

TL;DR

Context

Questions

Possibly related

Related topics