I would like to add some special tokens and train the tokens. For instance, this is an input example of BERT "[CLS] this is a special token special_token [SEP] The special token is ‘special_token’ ". I guess I should use some functions like tokenizer. additional_special_tokens(‘special_token’) to add the special token. How do I train the embedding of the token?
There are 2 things you need to do in order to train additional special tokens:
Add new tokens to the tokenizer.
You can either add “regular” tokens, as follows:
tokenizer.add_tokens(['newWord', 'newWord2'])
Or you can add them as special tokens (similar to [CLS] and [SEP]) by passing the additional argument special_tokens=True. This is equivalent to calling tokenizer.add_special_tokens (the latter accepts a dictionary rather than a list).