Sagemaker instances do not restart after TGI container crashses

I am running 3 instances of Falcon 40B on Sagemaker. 3 days ago all three instances failed after what I assume are just several prompts that caused them to fail (which seems like a different issue). The logs look like this:

2023-07-14T15:26:02.799-04:00 #033[2m2023-07-14T19:26:02.673608Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 2 failed:
2023-07-14T15:26:02.799-04:00 You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2023-07-14T15:26:02.799-04:00 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=89504560, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 60916 milliseconds before timing out.
2023-07-14T15:26:02.799-04:00 [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2023-07-14T15:26:02.799-04:00 [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
2023-07-14T15:26:02.799-04:00 terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=89504560, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 60916 milliseconds before timing out.
2023-07-14T15:26:02.799-04:00 #033[2m2023-07-14T19:26:02.673648Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for webserver to gracefully shutdown
2023-07-14T15:26:07.516-04:00 #033[2m2023-07-14T19:26:02.673788Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m714:#033[0m signal received, starting graceful shutdown
2023-07-14T15:26:36.648-04:00 #033[2m2023-07-14T19:26:36.426732Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Webserver terminated
2023-07-14T15:26:38.403-04:00 #033[2m2023-07-14T19:26:36.426762Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
2023-07-14T15:26:38.403-04:00 #033[2m2023-07-14T19:26:38.395159Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 1 terminated
2023-07-14T15:26:42.517-04:00 Error: ShardFailed

Nothing re-started after this. I’m unable to find anything that discusses how these instances are supposed to be re-loaded. Currently the AWS console says there are 3 instances in services, but obviously the TGI container is not running. Are there specific conditions where the containers will not be re-deployed or new instances will not be brought into service? I’m also curious that given the size of the instances (12xlarge) if this instance time is being billed while the TGI container is not restarted which seems like a big problem considering the costs.