Hi @priyanshu26, The free Inference API (serverless) is our solution to easily explore and evaluate models, and is subject to rate-limiting. We don’t provide the rate limit numbers because they change with how much volume we get. PRO / Enterprise organization accounts will get priority.
For larger volumes of requests, or if you need guaranteed latency/performance, we recommend using Inference Endpoints (dedicated) to easily deploy your models on dedicated, fully-managed infrastructure. Inference Endpoints will give you the flexibility to quickly create endpoints on CPU or GPU resources, and is billed by compute uptime vs character usage. Further pricing information can be found here.