Why transformer overfit quickly? how to solve it?


I have a general question and appreciate your feedback on it.
I am new to transformers. My main problem is that it overfits so quickly, I am using regularization methods such as augmentation and dropout, but after 2 epochs my validation accuracy starts to drop while the training accuracy reaches to highest (basically my model overfit).
do you have any suggestions?
Interestingly I never see this behavior when I use convolutions…

1 Like

My personal thought is if your data is less, it will overfit quickly. If you want to avoid it, reduce epochs. But, best way is to gather more data.

Having said that neither transformer nor neural networks suffer too much from overfitting. Some papers are there I guess. It’s good in generalizing most of the times. Remember transformer like models have quite good number of parameters, that’s also one reason of overfitting. But in downstream tasks, even if it overfits, it’s useful right.

Pre training is like a person who graduate with Masters. Fine tuning is like doing PhD ( except here it is quick :slight_smile: , use your graduation skills to be an expert in specific field. So overfitting is okay. Personal opinion only.


thanks for your reply. I have the same data in both cases.
but your answer helps actually. I think transformer just learn faster

Does data augmentation help?
Thanks, I met the same problem recently, the model is reaching almost 99% training accuracy but testing is always staying around 79%.


Hint: If you identify overfitting, use your validation set to tune your model hyper parameters. Once that’s done, use your unseen test set to do your final testing.

Hey there, sorry for bumping this thread. I found your reply interesting, so I have to ask: how would you use the validation set to tune the models hyperparameters?

Since you never train on the validation set, you can then train multiple models on the same training dataset, but while adjusting the hyperparameters. You can then select the model that performed the best on the validation set (given your metric of choice).
My understanding is that you can test your model on your test set, or even better, another test set that you did not create and has at least a fair number of samples that were not in the training set.

1 Like