I have come across this for two different tasks now, where my setup basically looks as follows:
A smaller dataset for fine-tuning (500k samples), as well as a larger version of this data set (2 million samples).
Hyperparameters are 1000 iterations warmup, 3 epochs training duration, and otherwise default.
Previous runs with the small dataset gave decent results for BERT, and slightly better results with RoBERTa. However, once I go and train with the larger dataset, RoBERTa models no longer show any signs of convergence and instead just predict nonsense. Note that the BERT model still (consistently) performs fine.
This is a problem across several (6) random seeds!
My question now is whether someone else has observed a similar behavior, or whether there are some caveats to the parameters that only let selective models reach a stable training state. Generally the RoBERTa results were better on the smaller data, so obviously I’d like to go with a stable run on the larger data as well.
hey @dennlinger when you say:
do you mean that both the training and validation loss don’t decrease or something else?
the two main parameters i’ve needed to tune XLM-R (not RoBERTa exactly, but close enough) effectively for text classification have been the learning rate and number of warmup steps, with the former generally needing to be in the 10^-6-10-5 range.
in the RoBERTa paper they used 6% of the total training steps for the warmup, so perhaps you could see whether increasing this to say 100,000 steps helps in your 2M samples training set.
what task are you working on?
@lewtun Thank you for your response. The loss does decrease but the model does not learn anything useful. The task is a sequence to sequence task, where the encoder and decoder are both roberta models. For the “bad” models the generated output is usually something like:
the="="="="="="="="="="="="="="="="="="="="="="="="="="="="="="="=" The the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the.... (Despite the fact that we set the no_repeat_ngram_size =3 ). The point about the warmup step is valid, and maybe we should increase it. I will try with that. What is strange, is that the training set up for the smaller sample (500K) and the larger (2M) are exactly the same, the only difference is the data. Therefore such instability is very strange.
oh this sounds very peculiar indeed! if i understand correctly, you’re using the
EncoderDecoderModel (docs) to create your Roberta2Roberta model right?
one thing i notice in the code snippet example is that you need to specify the
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
this wouldn’t explain why you get good generations with the smaller dataset, but is worth double-checking. another idea would be to swap the Roberta decoder for a genuine autoregressive model like GPT-2 - if the effect persists, this would suggest something is strange in your larger corpus
@dennlinger @satyaalmasian Have you come up with how to make Roberta converge? Facing same issue. Thanks
@brgsk Unfortunately not; we instead decided to go with training initialized from BERT-base checkpoints instead, which worked much better.
Oh I see, that’s strange. Thanks alot!
I have observed similar instability while using RoBERTa for classification and sequence labelling tasks. It gives really good results in one run and in the next run, it starts giving random results. BERT seems to give much more consistent results.