Batch size vs gradient accumulation

So where is the difference in performance between using GA and without GA as @sgugger mentioned in his answer?

I am not sure that it just involves hardware only.

1 Like