I have a dataset with 7,000 lines I plan to train with GPT2. I am unsure how many steps I should train it to, however. Does anybody have any advice on how I can avoid overfitting?
Avoiding overfit is a big question, I’ll just say some terms I’ve know / heard may help.
data augmentation, layer wise lr, weight decay, gradient clip
There are lots of things you can explore.