Distributed Training on Sagemaker

I can confirm that running the t5/bart_summarization notebook works as it is on the ml.p3dn.24xlarge. I can remember that someone tried to run it on the 16.xlarge and needed to decrease the batch_size too.

The error you attached are still showing the Cuda error

[1,2]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 2; 15.78 GiB total capacity; 13.72 GiB already allocated; 249.75 MiB free; 14.04 GiB reserved in total by PyTorch)

Can you downsample your dataset a bit and try again? or try some steps on a ml.p3dn.24xlarge