Finetuning ByT5 with a batch size of 1 on T4 GPU

Hey guys!

I am working with the ByT5 model and TydiQA.

Specifically, I am finetuning a ByT5 model into subsets of Tydi-GoldP for a specific language. I first started working with PyTorch in an A100 GPU (with 80 GB), but now I am moving to a T4 GPU with 16GB approx.

In the A100 GPU, I used an input seq. length of 2048, the output was 768, a Learning rate of 1e-4 (Adafactor), batch size was 6, and grad. accumulation of 4 steps.

For moving to T4, I switched to a batch size of 1 with an accumulation of 24, so I keep the original equivalent batch size.

However, the results are far distant. In the A100 setup, I got a 60 F1 score for the English subset, but in the T4 I cannot surpass the mark of a 30 F1 Score. The loss in A100 is about 0.01, but for T4 I got a mean loss of 0.2 Ëś 03.

Do you guys have any tips or comments about that? Didn’t find any paper or results about different results or poor performance with a batch size of 1.

Thank you!