For fine tuning model like T5, should we tune the learning rate base on the length of the dataloader?
For example, I am now tuning T5 for just one epoch using 1k pairs of sentence with 10 batch size, which means the optimizor will take 100 steps. And I use 0.1 learning rate atm, which gives a me the lowest training loss without overfitting.
Now I increase the data to 1milion pair. Should I divide the learning rate by 1000? Otherwise, the optimizor will take 100000 times using the previous learning rate, which may cause overfitting?
I’m now using Adam and Adafactor, which help adjust the step size.