How to properly add new vocabulary to BPE tokenizers (like Roberta)?

Thanks a lot Dan !

=> Well, isn’t it exactly what we do when we run tokenizer.add(['word1', 'word2']) and then model.resize_token_embeddings(len(tokenizer)) ? Only update the shape of last layer ?

=> As Bert is using a WordPiece tokenization (instead of Roberta BPE), why didn’t you / couldn’t we use a SpaCY WordPiece tokenization before adding to the vocabulary ? I found this example (see Annex at the bottom) where this is what they do. I am considering switching from Roberta to Bert for this ? NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium