Inference Endpoint not stable

Hi. I have a model running on an AWS T4 instance. I have scale-to-zero set to never and autoscaling to 2, and then I was expecting to be able to provide a service that was up 24/7, but sadly that is not the case. From time to time, the instance just dies and I get HTTP 503 responses (service unavailable). The only solution is to manually restart the instance.

This is of course a deal breaker for someone who wants to run a stable service.

Has anyone else experienced it?


could it be from errors running the model? i have experienced this with oobabooga crashing if you try to load models it did not like - such as the ones using multimodal. if its a single model inferencing endpoint, i would check logs…in any case, t series are pretty low powered…i think you should try g4 ro g5 to start with, also - t series does not have any GPUs…some of the models require it.

Obviously instance stability despite the setup, have you checked logs or can you post some logs we can help you with? Also I always like to setup some automated monitoring so when shit hits the fan it tells me what’s going on and pinpoints it…I want to say it HAS to be CPu gpu memory something with the error your giving

Having a similar problem. The endpoint mostly works fine - but several times daily we encounter this issue.

Our instance is configured as:

  • AWS eu-west-1
  • Intel Ice Lake 8 vCPU (no GPU)

Some additional details:

  • The 503 Service Not Available responses last for about ~10 seconds and then it recovers.
  • The endpoint shows as active while erroring, and there are no errors in the logs.
  • We are using a custom docker image that works fine on any other hosting service.

Would appreciate if this was fixed - HF Inference Endpoints are very developer friendly and we would like to keep using them.

EDIT: Here is an analytics overview of a completely fresh instance and some requests. The upticks in 503 status errors should be pretty obvious.