I have an inference endpoint that I have been using for quite some time now. I normally let it scale to 0 when not using it. However today I am unable to get it to start at all. I keep getting the following error:
[Server message]Endpoint failed to start. Scheduling failure: not enough hardware capacity
The model I am running is meta-llama/Llama-2-13b-chat-hf . Running it no a: GPU [medium] · 1x Nvidia A10G
I’ve tried reaching out to HuggingFace to see if it is possible to reserve a GPU instance or something because I really need it right now for launching a product.
Does anyone know of any other way to get something like this up and running or alternatively how long these periods of no available hardware can last?