Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings

Gianluca · April 21, 2021, 5:24pm

I am working on a corpus of text in which I have injected instances of a special string: “[CHARACTER]”.

I now want to train a BERT classification model on this pre-processed corpus so I instantiate a BertTokenizer passing it my special token:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', additional_special_tokens = ['[CHARACTER]'])

This outputs the message:

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.

So my question is how can I then go ahead and fine-tune the word embedding for this added special token?

I assume that this is a different process altogether then training the actual classifier itself? Could someone please enlighten me as to what needs to be done? Thanks.

Topic		Replies	Views
How to train the embedding of special token? Intermediate	1	4103	October 17, 2021
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	546	May 26, 2023
How to concatenate the word embedding for special tokens and words Intermediate	1	2514	June 13, 2021
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Working with named entities with bert Beginners	2	316	August 30, 2020

Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings

Related topics