Embeddings of added words


I’m using run_mlm.py to fine-tune a BERT model to the type of text I’ll be working with (legal text). In the process I want to add some words to the vocabulary, which I do nicely with add_tokens function (from the tokenizer).

My question is: if I use a BERT model with a tokenizer that contains a new added token (listed in the added_tokens.json file), does it generate automatically (and randomly) a new embedding for each new added token?

Why am I asking it? I’m asking because I ran several times (I’m using paperspace and so I have only a 6 hours max of runtim) a modified version of (transformers/run_mlm.py at main · huggingface/transformers · GitHub), that added the new token when it loads the tokenizer. And for each time I run the mlm training, it started from the previous checkpoint. So, I’m now wondering if the newly added tokens embeddings were training from the real beginning of the full training process, or only from the last checkpoint.

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing cps test words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problem Embeddings **make it easier to do machine learning on large inputs like sparse vectors representing words . Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.