Specifically, I am finetuning a ByT5 model into subsets of Tydi-GoldP for a specific language. I first started working with PyTorch in an A100 GPU (with 80 GB), but now I am moving to a T4 GPU with 16GB approx.
In the A100 GPU, I used an input seq. length of 2048, the output was 768, a Learning rate of 1e-4 (Adafactor), batch size was 6, and grad. accumulation of 4 steps.
For moving to T4, I switched to a batch size of 1 with an accumulation of 24, so I keep the original equivalent batch size.
However, the results are far distant. In the A100 setup, I got a 60 F1 score for the English subset, but in the T4 I cannot surpass the mark of a 30 F1 Score. The loss in A100 is about 0.01, but for T4 I got a mean loss of 0.2 ˜ 03.
Do you guys have any tips or comments about that? Didn’t find any paper or results about different results or poor performance with a batch size of 1.