How to properly add new vocabulary to BPE tokenizers (like Roberta)?

I would like to fine-tune RoBERTa on a domain-specific English-based vocabulary.

For that, I have done a TF-IDF on a corpus of mine, and extracted 500 words that are not yet in RoBERTa tokenizer.

As they represent only 1 percent of total tokenizer length, I don’t want to train the tokenizer from scratch.

So I just did :

tokenizer.add(['word1', 'word2'])
model.resize_token_embeddings(len(tokenizer))

BUT I see 2 problems related to BPE :

  1. My words are not truncated into sub-words. Is it fine enough ?
  2. There are no " Ġ " (\u0120) in my list. Should I add them manually ?*

I am adding that I could not find any precise answer to this question (that many of us have) : see Usage of Ġ in BPE tokenizer · Issue #4786 · huggingface/transformers · GitHub

Hello Pataleros,

I stumbled on the same issue some time ago. I am no huggingface savvy but here is what I dug up

Bad news is that it turns out a BPE tokenizer “learns” how to split text into tokens (a token may correspond to a full word or only a part) and I don’t think there is any clean way to add some vocabulary after the training is done.

Therefore, unfortunatly, the proper way would be to train a new tokenizer. Which makes transfer learning almost useless.

Now to the hacks !
Why can’t we add some words at the end of the vocab file ? Because then you change the output shape of your Roberta model and fine-tuning requires loading all your pretrained model except for the last layer. Not a trivial task but nothing outrageous (load full model, delete last layer, add same layer with your new voc size, save model)
Another possible hack would be to keep the same voc size, but change unused tokens (some chinese characters, or accents you don’t use etc…) with your additionnal vocabulary. It would be a bit harder as you have to locate those unused tokens.

As for me I just trained from scratch a small sized Bert, without transfer learning, which I advise you do so only if your domain specific English is restrained in vocabulary and grammar, and distinctly different from usual English.

Best of luck to you !

Dan

1 Like

Thanks a lot Dan !

=> Well, isn’t it exactly what we do when we run tokenizer.add(['word1', 'word2']) and then model.resize_token_embeddings(len(tokenizer)) ? Only update the shape of last layer ?

=> As Bert is using a WordPiece tokenization (instead of Roberta BPE), why didn’t you / couldn’t we use a SpaCY WordPiece tokenization before adding to the vocabulary ? I found this example (see Annex at the bottom) where this is what they do. I am considering switching from Roberta to Bert for this ? NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium

I don’t remember any of those arcane function :rofl: either they didn’t exist a year ago or I did a poor job searching for them. I have to give them a try !

I definitly chose Bert because the tokenizing part was easier for me

1 Like