Questions about the connection between tokenizer and the model

arbitropy · September 19, 2023, 10:54am

I have some basic questions which I have tried a lot to understand but never could find clear answer to.

I have been pretraining pretrained models like bert in my own language, when doing so, if I retrain the tokenizer using new different language dataset, won’t it essentially mean pretraining bert from scratch on new language due the new vocabulary? or is there still some benefit in using a pretrained model over creating a new model?
The MLM pretraining guide given in the course uses a subword tokenizer model, which creates same 128 sized chunks for creating equal length data entries. When I am pretraining a model that uses BPE tokenizer like Roberta-base, is the whole process still the same as bert pretraining? chunking and then using word_ids to generate MASK and labels? if yes, then why does word_id column shows a lot more words for the same sentence in BPE over subword tokenizer?
BPE tokenizer even one in another language can tokenize and decode without UNK tokens as far as I have seen, is there any benefit in retraining the tokenizer on the new dataset?

I have vague answers to these questions, but I would like to have clearer answers to my queries that will follow. Thank you.

Topic		Replies	Views
Tokenizer vs Model 🤗Tokenizers	0	249	June 24, 2024
Domain adaptation of Language Model and Tokenizer Beginners	8	2846	June 17, 2024
Train a new tokenizer from scratch 🤗Transformers	4	1707	November 10, 2020
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4381	February 20, 2022
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022