I have come across this for two different tasks now, where my setup basically looks as follows:
A smaller dataset for fine-tuning (500k samples), as well as a larger version of this data set (2 million samples).
Hyperparameters are 1000 iterations warmup, 3 epochs training duration, and otherwise default.
Previous runs with the small dataset gave decent results for BERT, and slightly better results with RoBERTa. However, once I go and train with the larger dataset, RoBERTa models no longer show any signs of convergence and instead just predict nonsense. Note that the BERT model still (consistently) performs fine.
This is a problem across several (6) random seeds!
My question now is whether someone else has observed a similar behavior, or whether there are some caveats to the parameters that only let selective models reach a stable training state. Generally the RoBERTa results were better on the smaller data, so obviously I’d like to go with a stable run on the larger data as well.