I have some basic questions which I have tried a lot to understand but never could find clear answer to.
- I have been pretraining pretrained models like bert in my own language, when doing so, if I retrain the tokenizer using new different language dataset, won’t it essentially mean pretraining bert from scratch on new language due the new vocabulary? or is there still some benefit in using a pretrained model over creating a new model?
- The MLM pretraining guide given in the course uses a subword tokenizer model, which creates same 128 sized chunks for creating equal length data entries. When I am pretraining a model that uses BPE tokenizer like Roberta-base, is the whole process still the same as bert pretraining? chunking and then using word_ids to generate MASK and labels? if yes, then why does word_id column shows a lot more words for the same sentence in BPE over subword tokenizer?
- BPE tokenizer even one in another language can tokenize and decode without UNK tokens as far as I have seen, is there any benefit in retraining the tokenizer on the new dataset?
I have vague answers to these questions, but I would like to have clearer answers to my queries that will follow. Thank you.