seems to work:
sometimes due to weird, unknown implementation details, grad accum can give a little bit of memory overhead (even tho it shouldn’t), so if
bs_per_device=8
,grad_accum=1
is maxing out the GPU mem, it’s possible OOM may show up i think on the flip side, suppose you want effective BS to be 16 withbs_per_device=8
,grad_accum=2
(say 1 GPU only), it would be surprising ifbs_per_device=4
,grad_accum=4
OOMs, andgrad_accum=4
doesn’t give that much overhead overgrad_accum=2