Hey guys,
I’m running some basic tests for preparing a fine tuning job for mT5-large using aws Sagemaker. With a p3.2xlarge
instance (1 Tesla V100-GPU) and the following setting, the job runs successfully:
per_device_train_batch_size=16
gradient_accumulation_steps=8
gradient_checkpointing=True
But if I use p3.8xlarge
(4 GPUs) with the setting
per_device_train_batch_size=4
gradient_accumulation_steps=8
gradient_checkpointing=True
I run into a memory error. Any idea how I should adapt the parameters?