I stumbled on the same issue some time ago. I am no huggingface savvy but here is what I dug up
Bad news is that it turns out a BPE tokenizer “learns” how to split text into tokens (a token may correspond to a full word or only a part) and I don’t think there is any clean way to add some vocabulary after the training is done.
Therefore, unfortunatly, the proper way would be to train a new tokenizer. Which makes transfer learning almost useless.
Now to the hacks !
Why can’t we add some words at the end of the vocab file ? Because then you change the output shape of your Roberta model and fine-tuning requires loading all your pretrained model except for the last layer. Not a trivial task but nothing outrageous (load full model, delete last layer, add same layer with your new voc size, save model)
Another possible hack would be to keep the same voc size, but change unused tokens (some chinese characters, or accents you don’t use etc…) with your additionnal vocabulary. It would be a bit harder as you have to locate those unused tokens.
As for me I just trained from scratch a small sized Bert, without transfer learning, which I advise you do so only if your domain specific English is restrained in vocabulary and grammar, and distinctly different from usual English.
=> Well, isn’t it exactly what we do when we run tokenizer.add(['word1', 'word2']) and then model.resize_token_embeddings(len(tokenizer)) ? Only update the shape of last layer ?