Runtime error Launch timed out, workload was not healthy after 30 min

Hello,

I’ve built a custom Dockerfile that installs an axolotl environment. After that environment is setup, I have it run a “train.sh” bash script. This script runs axolotl to fine-tune llama2-13b-hf on a 2xA10G instance with a custom dataset, stored on HF. The installation goes fine. I can alter the Dockerfile to execute a bash shell at the end, login to it and run the train.sh manually and that works fine. However, if I have the Dockerfile run “train.sh”, it stops after 30 minutes with the following message:

Runtime error

Launch timed out, workload was not healthy after 30 min

Any advice?