I am training a large dataset (118208 sentences), and it’s taking upwards of 3149h on Google Colab using a TPU.
I used the same dataset on a T5, which took 7 days. I assume the difference is due to the wordpieces it uses (while the T5 uses 32k, the mt5 uses 250k).
I have tried playing around with the learning rate (highest that I’ve tried being 4 and lowest being 2e-05) but nothing seems to help.
Does anybody have any ideas?