My name is Grant, and I’m reaching out regarding the integration of a Hugging Face inference endpoint into our environment.
We are specifically considering the following endpoint for example: https://ui.endpoints.huggingface.co/grantdozier/new?repository=bigscience%2Fbloom
Our primary question concerns the capacity of a single dedicated endpoint. Specifically:
1. How many concurrent users can one dedicated endpoint serve?
2. What is the recommended ratio of endpoints to users for optimal performance?
This information is crucial for our planning, as we anticipate scaling to approximately 1,000 users within the next two months. Our goal is to determine the most efficient number of dedicated endpoints needed to support our user base.
Additionally, we are evaluating several models on your platform for potential dedicated endpoint deployment. Any insights you can provide on best practices for scaling and performance optimization would be greatly appreciated.