I’m using run_mlm.py to fine-tune a BERT model to the type of text I’ll be working with (legal text). In the process I want to add some words to the vocabulary, which I do nicely with
add_tokens function (from the tokenizer).
My question is: if I use a BERT model with a tokenizer that contains a new added token (listed in the
added_tokens.json file), does it generate automatically (and randomly) a new embedding for each new added token?
Why am I asking it? I’m asking because I ran several times (I’m using paperspace and so I have only a 6 hours max of runtim) a modified version of (transformers/run_mlm.py at main · huggingface/transformers · GitHub), that added the new token when it loads the tokenizer. And for each time I run the mlm training, it started from the previous checkpoint. So, I’m now wondering if the newly added tokens embeddings were training from the real beginning of the full training process, or only from the last checkpoint.