Embeddings of added words

miguelwon · September 5, 2022, 1:34pm

Hi,

I’m using run_mlm.py to fine-tune a BERT model to the type of text I’ll be working with (legal text). In the process I want to add some words to the vocabulary, which I do nicely with add_tokens function (from the tokenizer).

My question is: if I use a BERT model with a tokenizer that contains a new added token (listed in the added_tokens.json file), does it generate automatically (and randomly) a new embedding for each new added token?

Why am I asking it? I’m asking because I ran several times (I’m using paperspace and so I have only a 6 hours max of runtim) a modified version of (transformers/run_mlm.py at main · huggingface/transformers · GitHub), that added the new token when it loads the tokenizer. And for each time I run the mlm training, it started from the previous checkpoint. So, I’m now wondering if the newly added tokens embeddings were training from the real beginning of the full training process, or only from the last checkpoint.

anon75331526 · September 9, 2022, 11:50am

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing cps test words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problem Embeddings **make it easier to do machine learning on large inputs like sparse vectors representing words . Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Topic		Replies	Views
Training BERT for word embedding Beginners	17	14469	November 12, 2022
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3763	April 21, 2021
Working with named entities with bert Beginners	2	316	August 30, 2020
Training embeddings of tokens 🤗Transformers	2	5201	January 27, 2021
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18747	January 25, 2024

Embeddings of added words

Related topics