Thanks a lot Dan !
=> Well, isn’t it exactly what we do when we run tokenizer.add(['word1', 'word2'])
and then model.resize_token_embeddings(len(tokenizer))
? Only update the shape of last layer ?
=> As Bert is using a WordPiece tokenization (instead of Roberta BPE), why didn’t you / couldn’t we use a SpaCY WordPiece tokenization before adding to the vocabulary ? I found this example (see Annex at the bottom) where this is what they do. I am considering switching from Roberta to Bert for this ? NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium