Apologies, two part question.
Part A) I have a trained tokenizer and a trained LLM (from roberta-base). The LLM was trained using the trained tokenizer, however, the tokenizer is trained from a BERT tokenizer. Will the BERT/RoBERTa mismatch cause issues/loss of accuracy down the line?
Part B) My trained LLM is underperforming the out-of-the box RoBERTa model slightly after 3 epochs of training on ~15 millions examples with val loss (1.25 val loss vs. 1.30 val loss). However, anecdotal evidence from MLM examples, the trained model seems to have learned the specific language and is performing better. Why through the trainer.evaluate(), would the out-of-the-box model still have a lower loss? Is this just not simply enough epochs for a big dataset or could this be related to part A? Could catastrophic forgetting be to blame and that’s why for examples I care about the custom LLM is outperforming RoBERTa but RoBERTa has a lower val loss?