The documentation is rather vague on the limits of the Free Inference API and similarly vague what subscribing to a ‘Pro’ account would change in the limits.
Could somebody comment in their experience what the limits of the Inference API are? In particular
- Does moving to Pro change the limit for the model size which can be used? (Free has a limit of 10GB)
- Are there any hourly / monthly character (or token?) limits for queries or responses?
- Is there any rate limiting (request per minute)?
- Does Pro change anything regarding the time until a model is loaded / unloaded?
Bonus question:
- Is there a way to use quantized models with Free Inference API?
Thanks!