The loss value is not decreasing training the Roberta model

Hi, I load the Roberta pre-trained model from the transformers library and use it for the sentence-pair classification task. The loss function used to decrease during the training per epoch until the last week, but now even though all of the parameters, including the batch size and the learning rate have the same value, when I train my model the value of the loss function is not decreasing. I am a little bit confused and I have trained my model using various parameters and also I utilized another code in PyTorch, but still, the loss function is not decreasing. Can anyone help me to figure out the problem?

here is the link to my code:

and the dataset:
https://drive.google.com/drive/folders/1CUH_z_HI31-yfj8hOmRfJBKRKe_BNkku

I can’t give you an answer, but just a few questions:

Are you sure you are running exactly the same code that previously worked?
If so:
are you getting exactly the same output, including that warning about not using all the roberta parameters?
(That’s a lot of layers not being used.)
has your data been changed?
has the colab environment changed - for example, is it the same version of transformers?

What is the loss function value before you start training?
What would you expect the loss to be showing as?
Could it possibly be training completely within the first epoch?

Do you still have a notebook (with output) that shows what used to happen when it was working?

Is your Colab Runtime set to GPU or CPU?

Finally: exactly when did it stop working?

Hi @zahraabbasian ,

I just tried running your code, and I think that the loss indeed decreases. (It fluctuates a lot [e.g. at .2x] and looks increasing for some time, but if you wait long enough, we can see it is slowly decreasing [e.g. to .18x]) . In the interactive log, I saw the progress bar stop quite early, so maybe you can wait for 1 epoch to conclude.

Moreover, since you use TF, it’s pretty straightforward to use TPU which can give at least 4x speed boost in Colab. This Kaggle notebook shows a very concise way to efficiently train/predict Huggingface’s XLMRoberta (which is the same format as Roberta ) . Hope it help!

thanks for your answer,
the code is the same

  • actually no, the warning messages have been changed in two cases:
  1. in converting dataset to features I used to get the the below message for more times while now I receive it just for one time

‘’ Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors’’

  1. about the parameters also the previous warning was this:
    ‘’ Some weights of the model checkpoint at roberta-base were not used when initializing ROBERTA: [‘lm_head’] - This IS expected if you are initializing ROBERTA from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - This IS NOT expected if you are initializing ROBERTA from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the weights of ROBERTA were initialized from the model checkpoint at roberta-base.’’

( but I searched the new warning and noticed that as long as I am using the embedding of the first token this warning is not important.)

  • data is the same

  • due to the changes in the warning messages it seems like that the version of transformers has been changed, but I do not know what was the version before, and now when I try it with a lower version, I still get the same output

  • I do not know, but I can check

  • the loss function used to start from 0.20 and decrease at least 0.02 per epoch, and the model used to converge to zero loss value

  • yes I have a notebook which was trained in October 15th and everything was ok with that
    ( I can share it if you need)

  • The colab env in both cases is GPU

  • I do not know the exact time, but I am sure that it was working well until October 15

thank you for your response,
actually, I trained the model for more than 5 epochs and observed no sufficient decrease in the loss value, it was just fluctuating near 0.2 while it used to decrease at least 0.2 per epoch previously.

That is strange, since in my run, the loss decreases to 0.18 in 1/5 epoch (less than 300 steps), do you observe the same ?
(you can see from the interactive log)

yes it fluctuates during the epoch, but at last when the epoch finishes, it converges finallly to 0.2 value .

1 Like

Okay. I don’t have time to run more than 1 epoch today, so maybe I can try tomorrow.
Nevertheless, I agree with your belief that the WARNING is not important , since all of the warned layers seem to be output-head of the PretrainingRoberta, while we are not using that head in finetuning.

I also think version should be irrelevant since Roberta is a relatively old model in HF and has been very stable.

I still suggest you to use TPU though since it takes too much time to experiment more than 1 epoch.

I changed your code to use TPU, and other few changes e.g. setting Each_seq_length = 128 removing dropout to simplify the experiment.

On the first few epoch the loss indeed looks 0.2x-ish , but still decreasing. You can clearly see “overfitting” after epoch 8. So everything looks fine to me.

319/319 [==============================] - 148s 465ms/step - loss: 0.2118 - val_loss: 0.2075
Epoch 2/18
319/319 [==============================] - 66s 207ms/step - loss: 0.2062 - val_loss: 0.2024
Epoch 3/18
319/319 [==============================] - 66s 208ms/step - loss: 0.2046 - val_loss: 0.1902
Epoch 4/18
319/319 [==============================] - 67s 209ms/step - loss: 0.2022 - val_loss: 0.1905
Epoch 5/18
319/319 [==============================] - 67s 209ms/step - loss: 0.2004 - val_loss: 0.2246
Epoch 6/18
319/319 [==============================] - 67s 209ms/step - loss: 0.1994 - val_loss: 0.1829
Epoch 7/18
319/319 [==============================] - 67s 209ms/step - loss: 0.1866 - val_loss: 0.1857
Epoch 8/18
319/319 [==============================] - 67s 209ms/step - loss: 0.1639 - val_loss: 0.1929
Epoch 9/18
319/319 [==============================] - 67s 210ms/step - loss: 0.1417 - val_loss: 0.5679
Epoch 10/18
319/319 [==============================] - 67s 209ms/step - loss: 0.1123 - val_loss: 0.4180
Epoch 11/18
270/319 [========================>.....] - ETA: 9s - loss: 0.0942

You can see my modified colab here .

Hope it helps!

1 Like

Thank you very much for your helpful guide, it works fine after 6-7 epochs. I think this delay in learning is due to the mentioned warning because I used TFAutoModel.from_pretrained(‘roberta-base’) instead of TFRobertaModel.from_pretrained(‘roberta-base’), and as it initializes all of the weights, it converges much sooner than the existing model.