CUDA out of memory when running on multiple GPUs

jabudu · June 22, 2022, 9:55am

Hey guys,

I’m running some basic tests for preparing a fine tuning job for mT5-large using aws Sagemaker. With a p3.2xlarge instance (1 Tesla V100-GPU) and the following setting, the job runs successfully:

per_device_train_batch_size=16
gradient_accumulation_steps=8
gradient_checkpointing=True

But if I use p3.8xlarge (4 GPUs) with the setting

per_device_train_batch_size=4
gradient_accumulation_steps=8
gradient_checkpointing=True

I run into a memory error. Any idea how I should adapt the parameters?

Topic		Replies	Views
Regarding CUDA OOM! Amazon SageMaker	0	497	February 14, 2023
Cuda memory error on unchanged workshop 1 notebooks Amazon SageMaker	1	790	December 1, 2021
Fine-Tuning GPT-J CUDA Memory Error Amazon SageMaker	1	807	February 13, 2023
Getting CUDA memory error at endpoint - what are my options? Amazon SageMaker	5	3282	May 20, 2022
Out of Memory error with multi-gpu training but no error with just one gpu? Amazon SageMaker	0	463	December 12, 2023

CUDA out of memory when running on multiple GPUs

Related topics