How to train the embedding of special token?

xuan4470 · October 17, 2021, 2:29am

I would like to add some special tokens and train the tokens. For instance, this is an input example of BERT "[CLS] this is a special token special_token [SEP] The special token is ‘special_token’ ". I guess I should use some functions like tokenizer. additional_special_tokens(‘special_token’) to add the special token. How do I train the embedding of the token?

Thank you

nielsr · October 17, 2021, 7:20am

There are 2 things you need to do in order to train additional special tokens:

Add new tokens to the tokenizer.

You can either add “regular” tokens, as follows:

tokenizer.add_tokens(['newWord', 'newWord2'])

Or you can add them as special tokens (similar to [CLS] and [SEP]) by passing the additional argument special_tokens=True. This is equivalent to calling tokenizer.add_special_tokens (the latter accepts a dictionary rather than a list).

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

Resize the token embedding matrix of the model, so that it matches with the tokenizer:

model.resize_token_embeddings(len(tokenizer))

Next, you can fine-tune your model on your custom dataset, and train these additional tokens.

Sources:

Topic		Replies	Views
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3764	April 21, 2021
How to concatenate the word embedding for special tokens and words Intermediate	1	2513	June 13, 2021
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	544	May 26, 2023
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Is it OK to get word embedding without adding special tokens? Beginners	3	1360	April 15, 2023

How to train the embedding of special token?

Related topics