I pretrained two models for Nepali language which is in devnagari script and similar to hindi, first model was a DistillBertmodel with Wordpiece tokenizer and second was Roberta model with ByteLevelBPE tokenizer. For the first model i used oscar nepali dataset which is relatively small dataset. Even for first 50000 optimization steps the model was performing really well, it could predict words based on the context. However for the second model with ByteLevelBPE tokenizer I used a bigger dataset with almost 700k lines. But for this model it is not performing as well as the DistillBert model. So coming to my question. I have few questions about the tokenizer and model.
- Wordpiece tokenizer normalized the words in the sentence. Why did this happen?
- Is Distillbert a better model for pre training than roberta or is it due to the tokenizer that i am getting bad results for roberta model? Because while studying about models Roberta was said to have higher parameters and better model than distillbert.