The documentation is rather vague on the limits of the Free Inference API and similarly vague what subscribing to a ‘Pro’ account would change in the limits.
Could somebody comment in their experience what the limits of the Inference API are? In particular
Does moving to Pro change the limit for the model size which can be used? (Free has a limit of 10GB)
Are there any hourly / monthly character (or token?) limits for queries or responses?
Is there any rate limiting (request per minute)?
Does Pro change anything regarding the time until a model is loaded / unloaded?
Bonus question:
Is there a way to use quantized models with Free Inference API?
Hi @GuusBouwensNL , can you tell what api rate limit are you able to achieve with the PRO plan?
After how many requests are you hitting the rate limit.
It is still vague and fluid at all for Pro account users.
In Pro status, the restrictions are relaxed. I think it’s being done, so it’s being done. That’s all I can say.
Plus, unlike in 2023, even the smaller models don’t work as well now. The Free Serverless Inference API and widgets for all but the very best models is virtually obsolete now.
Is there a way to use quantized models with Free Inference API?
So far, this is still not possible. Now that quantization is becoming more and more commonplace and there is particularly no need not to do it, HF will have to deal with it eventually…