Serverless Inference API

Is the Serverless Inference API basically my own LLM engine?

When it works stably, you can say it does, but none of the users know the conditions under which it works stably, and there is no explanation anywhere. There are no guidelines. The only way is to measure it yourself, as in a natural science class.

And if I pay $10 a month to HuggingFace, I get 300 queries per hour?

The Pro subscription allows for relatively stable and regular use of Llama 70B, for example, but again, there is no numerical guide as to exactly how much it can be used. Even if we did measure it, maybe it will change tomorrow…

In general, think of the Pro subscription as a service that is somehow more comfortable but to what extent no one knows, although apparently $20 Enterprise is also like that.
I’m also a subscriber, and the Zero GPU space is useful, though buggy.

P.S.

If you have a question about Zero GPU, there is a dedicated community on HF, so you can be sure to ask there, but there is no stable place to ask about the Serverless Inference API.
There is a github for extending the functionality, but the issue of server limitations is probably outside their expertise.