Integration and Scale

grantdozier · September 9, 2024, 8:21pm

My name is Grant, and I’m reaching out regarding the integration of a Hugging Face inference endpoint into our environment.

We are specifically considering the following endpoint for example: https://ui.endpoints.huggingface.co/grantdozier/new?repository=bigscience%2Fbloom

Our primary question concerns the capacity of a single dedicated endpoint. Specifically:

1. How many concurrent users can one dedicated endpoint serve?
2. What is the recommended ratio of endpoints to users for optimal performance?

This information is crucial for our planning, as we anticipate scaling to approximately 1,000 users within the next two months. Our goal is to determine the most efficient number of dedicated endpoints needed to support our user base.

Additionally, we are evaluating several models on your platform for potential dedicated endpoint deployment. Any insights you can provide on best practices for scaling and performance optimization would be greatly appreciated.

mahmutc · September 10, 2024, 12:22pm

hi @grantdozier

It depends on your configuration and user.

huggingface suggests Nvidia A100 (8 GPUs · 640 GB - 88 vCPUs · 1160 GB) for bigscience/bloom but of course you can try smaller instances first.

You might be interested in this article:

And this one is about scaling:

grantdozier · September 11, 2024, 8:12pm

Thank you so much for taking the time to respond. HuggingFace support is really awesome! I liked the article about "Why we’re switching … ". I found the aspect about having a multi-model endpoint extremely interesting. We are looking to have about 5 different models, can you tell me a bit about how we would go about implementing this, or point us to an example of this being implemented successfully?

Topic		Replies	Views
Paid API Service Beginners	6	1552	January 6, 2023
Cannot run dedicated server for interface endpoint Inference Endpoints on the Hub	1	42	November 23, 2024
Getting "No worker is available to serve request: model" with HuggingFaceModel endpoint Amazon SageMaker	13	5161	March 22, 2022
Performance of hosted inference API Beginners	0	303	February 16, 2021
503 No worker is available when calling single huggingface endpoint Amazon SageMaker	11	4420	April 7, 2022

Integration and Scale

Related topics