CUDA out of memory when running on multiple GPUs

Hey guys,

I’m running some basic tests for preparing a fine tuning job for mT5-large using aws Sagemaker. With a p3.2xlarge instance (1 Tesla V100-GPU) and the following setting, the job runs successfully:

per_device_train_batch_size=16
gradient_accumulation_steps=8
gradient_checkpointing=True

But if I use p3.8xlarge (4 GPUs) with the setting

per_device_train_batch_size=4
gradient_accumulation_steps=8
gradient_checkpointing=True

I run into a memory error. Any idea how I should adapt the parameters?