Finetuning T5 problems

Hello everyone,

I want to use a finetuning script for a pretrained T5 model to map one sequence of tokens to another. While I get reasonable improvements for a smaller subsets (e.g. 80 train, 20 val), it completely breaks when I test it on larger amounts of data (e.g. 400 train, 100 val). I already experimented with batch sizes, gradient accumulation and weight decay. For learning rates I tested to start with 3e-4 as well as with 5e-5. I also attached the loss curve for the larger case where the training breaks

Does someone have any hints or clues what might be the problem in my setup?

Thank you for your time and help

1 Like

It’s difficult to pinpoint the issue without information beyond LR and graph, but common pitfalls seem to be like this…