Inference API Rate Limits

Hello!

I’ve been reading the documentation and a few articles on the Inference API, and I’d like to confirm if my understanding is correct — and also clarify a few remaining points.

:white_check_mark: My understanding so far:

  1. To use the Inference API with a pay-as-you-go model, I first need to subscribe to the Pro plan ($9/month). This includes $2 of usage credits.

  2. After exceeding the $2 monthly credit, each request is billed according to usage time × machine cost per second.

  3. For example, as stated in the docs:

    “A request to black-forest-labs/FLUX.1-dev that takes 10 seconds to complete on a GPU machine that costs $0.00012/second will be billed $0.0012.”

I’ve been testing the sentiment analysis model cardiffnlp/twitter-xlm-roberta-base-sentiment, and here’s what I observed:

  • Sending 200 comments as a batch takes ~3 seconds
  • Sending 1000 comments as a batch takes ~13.2 seconds
  • Since batching is possible, a single request containing an array of texts counts as one request, which is great!

Based on the $0.00012/sec GPU rate:

Cost per comment = (0.00012 × 13.2) / 1000 = ~$0.00000158
1 million comments = ~$1.58
To hit the $2 credit limit → ~1.26 million comments

Can you please confirm if this interpretation is correct?


:red_question_mark:Additional questions:

  1. Rate limits
    I couldn’t find specific information on rate limits for the Inference API.

    • What are the request-per-second or concurrent request limits under the Pro plan?
    • Are there soft/hard limits or burst allowances?
  2. Machine cost per second

    • Where can I find a full pricing table for the types of hardware (CPU/GPU) used in the Inference API?
    • I’d like to estimate more precisely how much my use case would cost based on hardware type and request duration.

Thanks a lot in advance! Let me know if any part of my understanding needs correction.

1 Like

understanding 1, 2, 3

Maybe true.

batching

Great! I didn’t know it…

Rate limits

This seems to change depending on the current situation, so there is no clear information, but my personal impression is that it is relatively strict for the Free Plan. Even with the Pro Plan, it does not seem to be unlimited.

If you want unlimited usage, you will probably have to consider a Dedicated Endpoint.

Machine cost per second

Could this be it…?

I have never seen any information that seems to be definitively correct on this matter.

When the Inference Provider is HF, is it okay to assume that it is fluid as to which machine a given model will actually be hosted on? @meganariley