Hi, I load the Roberta pre-trained model from the transformers library and use it for the sentence-pair classification task. The loss function used to decrease during the training per epoch until the last week, but now even though all of the parameters, including the batch size and the learning rate have the same value, when I train my model the value of the loss function is not decreasing. I am a little bit confused and I have trained my model using various parameters and also I utilized another code in PyTorch, but still, the loss function is not decreasing. Can anyone help me to figure out the problem?
I canât give you an answer, but just a few questions:
Are you sure you are running exactly the same code that previously worked?
If so:
are you getting exactly the same output, including that warning about not using all the roberta parameters?
(Thatâs a lot of layers not being used.)
has your data been changed?
has the colab environment changed - for example, is it the same version of transformers?
What is the loss function value before you start training?
What would you expect the loss to be showing as?
Could it possibly be training completely within the first epoch?
Do you still have a notebook (with output) that shows what used to happen when it was working?
I just tried running your code, and I think that the loss indeed decreases. (It fluctuates a lot [e.g. at .2x] and looks increasing for some time, but if you wait long enough, we can see it is slowly decreasing [e.g. to .18x]) . In the interactive log, I saw the progress bar stop quite early, so maybe you can wait for 1 epoch to conclude.
Moreover, since you use TF, itâs pretty straightforward to use TPU which can give at least 4x speed boost in Colab. This Kaggle notebook shows a very concise way to efficiently train/predict Huggingfaceâs XLMRoberta (which is the same format as Roberta ) . Hope it help!
actually no, the warning messages have been changed in two cases:
in converting dataset to features I used to get the the below message for more times while now I receive it just for one time
ââ Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errorsââ
about the parameters also the previous warning was this:
ââ Some weights of the model checkpoint at roberta-base were not used when initializing ROBERTA: [âlm_headâ] - This IS expected if you are initializing ROBERTA from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - This IS NOT expected if you are initializing ROBERTA from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the weights of ROBERTA were initialized from the model checkpoint at roberta-base.ââ
( but I searched the new warning and noticed that as long as I am using the embedding of the first token this warning is not important.)
data is the same
due to the changes in the warning messages it seems like that the version of transformers has been changed, but I do not know what was the version before, and now when I try it with a lower version, I still get the same output
I do not know, but I can check
the loss function used to start from 0.20 and decrease at least 0.02 per epoch, and the model used to converge to zero loss value
yes I have a notebook which was trained in October 15th and everything was ok with that
( I can share it if you need)
The colab env in both cases is GPU
I do not know the exact time, but I am sure that it was working well until October 15
thank you for your response,
actually, I trained the model for more than 5 epochs and observed no sufficient decrease in the loss value, it was just fluctuating near 0.2 while it used to decrease at least 0.2 per epoch previously.
That is strange, since in my run, the loss decreases to 0.18 in 1/5 epoch (less than 300 steps), do you observe the same ?
(you can see from the interactive log)
Okay. I donât have time to run more than 1 epoch today, so maybe I can try tomorrow.
Nevertheless, I agree with your belief that the WARNING is not important , since all of the warned layers seem to be output-head of the PretrainingRoberta, while we are not using that head in finetuning.
I also think version should be irrelevant since Roberta is a relatively old model in HF and has been very stable.
I still suggest you to use TPU though since it takes too much time to experiment more than 1 epoch.
I changed your code to use TPU, and other few changes e.g. setting Each_seq_length = 128 removing dropout to simplify the experiment.
On the first few epoch the loss indeed looks 0.2x-ish , but still decreasing. You can clearly see âoverfittingâ after epoch 8. So everything looks fine to me.
Thank you very much for your helpful guide, it works fine after 6-7 epochs. I think this delay in learning is due to the mentioned warning because I used TFAutoModel.from_pretrained(âroberta-baseâ) instead of TFRobertaModel.from_pretrained(âroberta-baseâ), and as it initializes all of the weights, it converges much sooner than the existing model.