How is it possible to get GPU memory errors when increasing the gradient_accumulation steps?

seems to work:

sometimes due to weird, unknown implementation details, grad accum can give a little bit of memory overhead (even tho it shouldn’t), so if bs_per_device=8 , grad_accum=1 is maxing out the GPU mem, it’s possible OOM may show up i think on the flip side, suppose you want effective BS to be 16 with bs_per_device=8 , grad_accum=2 (say 1 GPU only), it would be surprising if bs_per_device=4 , grad_accum=4 OOMs, and grad_accum=4 doesn’t give that much overhead over grad_accum=2