Why transformer overfit quickly? how to solve it?


I have a general question and appreciate your feedback on it.
I am new to transformers. My main problem is that it overfits so quickly, I am using regularization methods such as augmentation and dropout, but after 2 epochs my validation accuracy starts to drop while the training accuracy reaches to highest (basically my model overfit).
do you have any suggestions?
Interestingly I never see this behavior when I use convolutions…

My personal thought is if your data is less, it will overfit quickly. If you want to avoid it, reduce epochs. But, best way is to gather more data.

Having said that neither transformer nor neural networks suffer too much from overfitting. Some papers are there I guess. It’s good in generalizing most of the times. Remember transformer like models have quite good number of parameters, that’s also one reason of overfitting. But in downstream tasks, even if it overfits, it’s useful right.

Pre training is like a person who graduate with Masters. Fine tuning is like doing PhD ( except here it is quick :slight_smile: , use your graduation skills to be an expert in specific field. So overfitting is okay. Personal opinion only.


thanks for your reply. I have the same data in both cases.
but your answer helps actually. I think transformer just learn faster

Does data augmentation help?
Thanks, I met the same problem recently, the model is reaching almost 99% training accuracy but testing is always staying around 79%.