Integration and Scale

My name is Grant, and I’m reaching out regarding the integration of a Hugging Face inference endpoint into our environment.

We are specifically considering the following endpoint for example: https://ui.endpoints.huggingface.co/grantdozier/new?repository=bigscience%2Fbloom

Our primary question concerns the capacity of a single dedicated endpoint. Specifically:

1. How many concurrent users can one dedicated endpoint serve?
2. What is the recommended ratio of endpoints to users for optimal performance?

This information is crucial for our planning, as we anticipate scaling to approximately 1,000 users within the next two months. Our goal is to determine the most efficient number of dedicated endpoints needed to support our user base.

Additionally, we are evaluating several models on your platform for potential dedicated endpoint deployment. Any insights you can provide on best practices for scaling and performance optimization would be greatly appreciated.

hi @grantdozier

It depends on your configuration and user.

huggingface suggests Nvidia A100 (8 GPUs · 640 GB - 88 vCPUs · 1160 GB) for bigscience/bloom but of course you can try smaller instances first.

You might be interested in this article:

And this one is about scaling:

Thank you so much for taking the time to respond. HuggingFace support is really awesome! I liked the article about "Why we’re switching … ". I found the aspect about having a multi-model endpoint extremely interesting. We are looking to have about 5 different models, can you tell me a bit about how we would go about implementing this, or point us to an example of this being implemented successfully?