Continuing training masked LM: loss going up, performance going down

A few months ago, we’ve trained RoBERTa type language model from scratch on Bulgarian/Macedonian. Our TPU credits ran out, so we saved the final checkpoint and uploaded the model. We no longer have access to this exact environment.

We have now decided to train the model for longer. We set up a similar environment, with the exact same text data set. We use the same Transformers version of the library as before (Github fork). We then took the full saved previous model (including tokenizer and optimizer state) to continue training. We use the same training configuration (arguments) as before as well. Everything seems to work just fine. The logs tell us we tokenize using our saved tokenizer, we load the weights of the checkpoint and we continue training with the correct optimizer state (learning rate is correct).

Except…it doesn’t work. The loss simply continues to increase slightly each time we check. Moreover, the performance of the LM on downstream tasks is slightly worse on all tasks we try it on. Even after just 500 extra steps (batch size 2048) this already happens, while it was previously trained on 300k steps. To be specific, the loss went in 500 steps from 2.05 to 2.15, while it during initial training took over 50k steps to get to 2.05 from 2.15.

We are simply unable to come up with a reason why this is happening. I know it might be hard to speculate what is wrong without specific output and examples, but any advice/hunches are helpful to us. Thanks in advance.