What would be the best way to somehow “mix” a SentencePiece vocabulary trained on a corpus with English and German documents with the existing English only vocabulary of a pretrained transformer? So I take the pretrained model (let’s say English BERT thought it’s WordPiece, I know), somehow create a new mixed vocabulary and then finetune on my mixed language downstream task.
[I am not an expert]
When BERT “chooses” which word-pieces to use in its vocabulary, it does so by taking all the individual characters plus the most common combinations that it finds in its training corpus. (See Chris McCormick’s helpful blog and video).
I’m not entirely sure how to trigger this vocabulary-building, but I expect that your new data wouldn’t be big enough to make your new words count as the “most common”.