Hey @cdwyer1bod,
Thanks for opening the thread. Happy to help you.
Could still share the full cloudwatch logs? sometimes the errors are a bit hidden.
I saw you changed the instance ml.p3dn.24xlarge
to ml.p3.16xlarge
and kept the batch_size this could be the issue. Could reduce the batch_size to 2 or change the instances type?