How to train the embedding of special token?

I would like to add some special tokens and train the tokens. For instance, this is an input example of BERT "[CLS] this is a special token special_token [SEP] The special token is ‘special_token’ ". I guess I should use some functions like tokenizer. additional_special_tokens(‘special_token’) to add the special token. How do I train the embedding of the token?

Thank you

There are 2 things you need to do in order to train additional special tokens:

  1. Add new tokens to the tokenizer.

You can either add “regular” tokens, as follows:

tokenizer.add_tokens(['newWord', 'newWord2'])

Or you can add them as special tokens (similar to [CLS] and [SEP]) by passing the additional argument special_tokens=True. This is equivalent to calling tokenizer.add_special_tokens (the latter accepts a dictionary rather than a list).

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
  1. Resize the token embedding matrix of the model, so that it matches with the tokenizer:

model.resize_token_embeddings(len(tokenizer))

Next, you can fine-tune your model on your custom dataset, and train these additional tokens.

Sources:

2 Likes