Hi,
I am trying to use a large model (25 gigabytes). Every time I start up an endpoint, it will get killed by Sagemaker for not passing the ping health check.
The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.
The log shows that it’s in the process of downloading the model, it usually is around 80% of getting the largest file when Sagemaker decides that it’s an unhealthy endpoint. The time this takes is too long for the health checker.
I downloaded the image and ran it on my local docker and I could verify that it only starts responding to /ping
once the model has been loaded.
What are my options?
- Is it possible to disable the health check on Sagemaker?
- Is it possible to configure the timeout of the health check on Sagemaker?
- Do I need to create my own image with a hacked version of
sagemaker_huggingface_inference_toolkit
that runs some rudimentary HTTP server while the model is loading, then figure out how to run that instead of thehuggingface-pytorch-inference
image?
I’d be thankful if someone could share their experience with this.