From what I have understood from the hf-tutorial is that we should use a pretrained model with its own tokenizer for a good performance. My doubt is that, BERT uses wordpiece but RoBERTA (again BERT architecture) uses BPE as tokenization approaches. Can we mix match any model and tokenizer if we are pretraining the model from scratch, like in the case of RoBERTA ? In that case, can I pretrain the BERT/DistiBert model from scratch using BPE/Unigram tokenizer?
Is the rule of using same model with same tokenizer applicable on finetuning or inference purpose only ? Or is the architecture of each model itself is related to the tokenizing approach ?
I am trying to train a distilbert using unigram approach.