Hi,
I’m using run_mlm.py to fine-tune a BERT model to the type of text I’ll be working with (legal text). In the process I want to add some words to the vocabulary, which I do nicely with add_tokens
function (from the tokenizer).
My question is: if I use a BERT model with a tokenizer that contains a new added token (listed in the added_tokens.json
file), does it generate automatically (and randomly) a new embedding for each new added token?
Why am I asking it? I’m asking because I ran several times (I’m using paperspace and so I have only a 6 hours max of runtim) a modified version of (transformers/run_mlm.py at main · huggingface/transformers · GitHub), that added the new token when it loads the tokenizer. And for each time I run the mlm training, it started from the previous checkpoint. So, I’m now wondering if the newly added tokens embeddings were training from the real beginning of the full training process, or only from the last checkpoint.