Last year, I was using T4 (small and medium) instances on Hugging Face to fine-tune some datasets. Back then, I remember it only taking maybe 10 minutes to start up a T4 instance.
Today, I tried doing the same in my space. I tried to start both a T4 Medium and T4 Small instance. But now it has been taking nearly an hour for either of them to start up.
Or, after 30 minutes I’ll get this error:
runtime error
Scheduling failure: unable to schedule
I haven’t seen anything related on the status page today:
Is Hugging Face just resource constrained right now due to all of the recent Deepseek and other releases this past week? Or is there something else going on?
Is there a page that I can check to know the current load on Hugging Face?
This is occurring for me on A100 instances also so it’s not limited to specific machines.
I’m receiving the same Scheduling failure: unable to schedule failure code after my endpoints are stuck in Initializing state for quite a few minutes. This behavior is the same on multiple endpoints today.
Continuing to retry but hoping to hear some guidance or will need to shop around.
Looking at the symptoms, it seems that this case and the series of problems below are connected. Well, if any of them are fixed, the rest will be fixed too.
I’m confirming that my AWS east-us-1 endpoint is now running. I was able to kick it off this morning, and it initialized in the more typical timeframe I was accustomed to.
If anyone had similar issues and was on AWS machines it may have been related to the same issue with provisioning those.