I have an English-written dataset with a vocabulary that contains some words that may be missing from the standard vocabulary used for RobertaTokenizer. Hence, I’d like to include additional tokens in the tokenizer. I’d like to avoid training the tokenizer from scratch, as in such case I won’t be able to fine-tune pretrained roberta model on top of it.
Since I do not know ahead what is the entire list of tokens I’d like to add, I thought I can do the following:
Train a tokenizer from scratch on my new dataset, and then to look at the created vocab file and add all the new tokens (those that do not exist in the standard RobertaTokenizer vocab) via
tokenizer = RobertaTokenizer.from_pretrained("roberta-base") tokenizer.add_tokens(list_of_new_tokens, special_tokens=True)
(like described here: Huggingface BERT Tokenizer add new token - Stack Overflow).
Does this approach makes sense? Or am I missing something? Is there a better way to approach this issue?