Distributed Training run_summarization.py

Hey @cdwyer1bod,

Thanks for opening the thread. Happy to help you.
Could still share the full cloudwatch logs? sometimes the errors are a bit hidden.

I saw you changed the instance ml.p3dn.24xlarge to ml.p3.16xlarge and kept the batch_size this could be the issue. Could reduce the batch_size to 2 or change the instances type?