Best way to extend vocabulary of pretrained model?

marton-avrios · October 12, 2020, 12:54pm

What would be the best way to somehow “mix” a SentencePiece vocabulary trained on a corpus with English and German documents with the existing English only vocabulary of a pretrained transformer? So I take the pretrained model (let’s say English BERT thought it’s WordPiece, I know), somehow create a new mixed vocabulary and then finetune on my mixed language downstream task.

rgwatwormhill · October 12, 2020, 4:22pm

[I am not an expert]

When BERT “chooses” which word-pieces to use in its vocabulary, it does so by taking all the individual characters plus the most common combinations that it finds in its training corpus. (See Chris McCormick’s helpful blog and video).

I’m not entirely sure how to trigger this vocabulary-building, but I expect that your new data wouldn’t be big enough to make your new words count as the “most common”.

Owos · March 28, 2024, 1:46pm

@marton-avrios it’s 2024, found a solution yet?

Deema · February 9, 2025, 2:05am

it is 2025 any solution?

Topic		Replies	Views
Fine tunning pretrained bert with new vocabulary Beginners	0	449	October 1, 2020
How to deal with of new vocabulary? Beginners	1	547	November 3, 2021
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4639	January 22, 2021
How the vocabulary of BERT tokenizer is generated? 🤗Transformers	2	2947	January 6, 2024
Using custom embeddings for pre-training model for new vocabulary Beginners	0	205	December 25, 2023

Best way to extend vocabulary of pretrained model?

Related topics