How is it possible to get GPU memory errors when increasing the gradient_accumulation steps?

brando · January 22, 2024, 2:00am

seems to work:

sometimes due to weird, unknown implementation details, grad accum can give a little bit of memory overhead (even tho it shouldn’t), so if bs_per_device=8 , grad_accum=1 is maxing out the GPU mem, it’s possible OOM may show up i think on the flip side, suppose you want effective BS to be 16 with bs_per_device=8 , grad_accum=2 (say 1 GPU only), it would be surprising if bs_per_device=4 , grad_accum=4 OOMs, and grad_accum=4 doesn’t give that much overhead over grad_accum=2

Topic		Replies	Views
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2180	December 31, 2023
Increasing VRAM Usage with Transformers Trainer Leads to OOM on GPUs 🤗Transformers	2	1031	March 29, 2024
OOM error with multi-GPU training of Llama2-70B using QLora 🤗Accelerate	2	2453	October 17, 2023
Batch size vs gradient accumulation Beginners	9	32902	November 28, 2024
Any incompatibility of gradient_accumulation with the streaming data? 🤗Transformers	0	251	July 10, 2023

How is it possible to get GPU memory errors when increasing the gradient_accumulation steps?

Related topics