Hello everyone,
I want to use a finetuning script for a pretrained T5 model to map one sequence of tokens to another. While I get reasonable improvements for a smaller subsets (e.g. 80 train, 20 val), it completely breaks when I test it on larger amounts of data (e.g. 400 train, 100 val). I already experimented with batch sizes, gradient accumulation and weight decay. For learning rates I tested to start with 3e-4 as well as with 5e-5. I also attached the loss curve for the larger case where the training breaks
Does someone have any hints or clues what might be the problem in my setup?
Thank you for your time and help