Loading a large model - endpoint gets killed by ping health check

Hi,

I am trying to use a large model (25 gigabytes). Every time I start up an endpoint, it will get killed by Sagemaker for not passing the ping health check.

The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

The log shows that it’s in the process of downloading the model, it usually is around 80% of getting the largest file when Sagemaker decides that it’s an unhealthy endpoint. The time this takes is too long for the health checker.

I downloaded the image and ran it on my local docker and I could verify that it only starts responding to /ping once the model has been loaded.

What are my options?

  • Is it possible to disable the health check on Sagemaker?
  • Is it possible to configure the timeout of the health check on Sagemaker?
  • Do I need to create my own image with a hacked version of sagemaker_huggingface_inference_toolkit that runs some rudimentary HTTP server while the model is loading, then figure out how to run that instead of the huggingface-pytorch-inference image?

I’d be thankful if someone could share their experience with this.

I’m having the same issue. Any solutions?

same here UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-inference-2023-08-01-14-29-42-558: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch lo

Same here.

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-31-09-39-06-613: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint…

Cloudwatch: RuntimeError: weight lm_head.weight does not exist

@philschmid penny for your thoughts!

(post deleted by author)