How to properly add new vocabulary to BPE tokenizers (like Roberta)?

Pataleros · December 8, 2021, 1:56pm

Thanks a lot Dan !

=> Well, isn’t it exactly what we do when we run tokenizer.add(['word1', 'word2']) and then model.resize_token_embeddings(len(tokenizer)) ? Only update the shape of last layer ?

=> As Bert is using a WordPiece tokenization (instead of Roberta BPE), why didn’t you / couldn’t we use a SpaCY WordPiece tokenization before adding to the vocabulary ? I found this example (see Annex at the bottom) where this is what they do. I am considering switching from Roberta to Bert for this ? NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium

Topic		Replies	Views
Domain adaptation of Language Model and Tokenizer Beginners	8	2873	June 17, 2024
Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model 🤗Tokenizers	0	768	July 14, 2023
Tunning tokenizer on my own dataset 🤗Tokenizers	0	718	January 25, 2021
RoBERTa MLM fine-tuning Beginners	1	1874	November 24, 2021
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18766	January 25, 2024

How to properly add new vocabulary to BPE tokenizers (like Roberta)?

Related topics