Sagemaker instances do not restart after TGI container crashses

jtmtech · July 17, 2023, 3:10pm

I am running 3 instances of Falcon 40B on Sagemaker. 3 days ago all three instances failed after what I assume are just several prompts that caused them to fail (which seems like a different issue). The logs look like this:

	2023-07-14T15:26:02.799-04:00	#033[2m2023-07-14T19:26:02.673608Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 2 failed:
	2023-07-14T15:26:02.799-04:00	You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
	2023-07-14T15:26:02.799-04:00	[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=89504560, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 60916 milliseconds before timing out.
	2023-07-14T15:26:02.799-04:00	[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
	2023-07-14T15:26:02.799-04:00	[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
	2023-07-14T15:26:02.799-04:00	terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=89504560, OpType=ALLREDUCE, Timeout(ms)=60000) ran for 60916 milliseconds before timing out.
	2023-07-14T15:26:02.799-04:00	#033[2m2023-07-14T19:26:02.673648Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for webserver to gracefully shutdown
	2023-07-14T15:26:07.516-04:00	#033[2m2023-07-14T19:26:02.673788Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m714:#033[0m signal received, starting graceful shutdown
	2023-07-14T15:26:36.648-04:00	#033[2m2023-07-14T19:26:36.426732Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Webserver terminated
	2023-07-14T15:26:38.403-04:00	#033[2m2023-07-14T19:26:36.426762Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
	2023-07-14T15:26:38.403-04:00	#033[2m2023-07-14T19:26:38.395159Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 1 terminated
	2023-07-14T15:26:42.517-04:00	Error: ShardFailed

Nothing re-started after this. I’m unable to find anything that discusses how these instances are supposed to be re-loaded. Currently the AWS console says there are 3 instances in services, but obviously the TGI container is not running. Are there specific conditions where the containers will not be re-deployed or new instances will not be brought into service? I’m also curious that given the size of the instances (12xlarge) if this instance time is being billed while the TGI container is not restarted which seems like a big problem considering the costs.

Topic		Replies	Views
CPU/Memory Utilization Too High When Running Inference on Falcon 40B Instruct Amazon SageMaker	4	1573	August 31, 2023
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2567	July 31, 2023
Model Stream Error - Streaming times out after 60 seconds Amazon SageMaker	0	336	May 15, 2024
Creating Sagemaker Endpoint for 2 models (Segment Anything & YOLOv8) and Invoking it Amazon SageMaker	0	405	January 6, 2024
Sagemaker HuggingFaceModel crashed with CUDA error Amazon SageMaker	3	1260	February 20, 2025

Sagemaker instances do not restart after TGI container crashses

Related topics