Hello!
I’ve been reading the documentation and a few articles on the Inference API, and I’d like to confirm if my understanding is correct — and also clarify a few remaining points.
My understanding so far:
-
To use the Inference API with a pay-as-you-go model, I first need to subscribe to the Pro plan ($9/month). This includes $2 of usage credits.
-
After exceeding the $2 monthly credit, each request is billed according to usage time × machine cost per second.
-
For example, as stated in the docs:
“A request to black-forest-labs/FLUX.1-dev that takes 10 seconds to complete on a GPU machine that costs $0.00012/second will be billed $0.0012.”
I’ve been testing the sentiment analysis model cardiffnlp/twitter-xlm-roberta-base-sentiment
, and here’s what I observed:
- Sending 200 comments as a batch takes ~3 seconds
- Sending 1000 comments as a batch takes ~13.2 seconds
- Since batching is possible, a single request containing an array of texts counts as one request, which is great!
Based on the $0.00012/sec GPU rate:
Cost per comment = (0.00012 × 13.2) / 1000 = ~$0.00000158
1 million comments = ~$1.58
To hit the $2 credit limit → ~1.26 million comments
Can you please confirm if this interpretation is correct?
Additional questions:
-
Rate limits
I couldn’t find specific information on rate limits for the Inference API.- What are the request-per-second or concurrent request limits under the Pro plan?
- Are there soft/hard limits or burst allowances?
-
Machine cost per second
- Where can I find a full pricing table for the types of hardware (CPU/GPU) used in the Inference API?
- I’d like to estimate more precisely how much my use case would cost based on hardware type and request duration.
Thanks a lot in advance! Let me know if any part of my understanding needs correction.