API Scaling algoritms and configuration

panopstor · September 28, 2022, 2:55pm

I’m hoping you can share a bit more about the Hugging Face API scaling algorithm.

Is there any hysteresis on scaling?

What are the signals used for scaling?

I’m curious how the system will react in a circumstance such as:

What happens when the GPU is pegged by a single call for a largish model for, say, 5-10 seconds? Nvidia-smi will typically report >95% GPU utilization without actually saturating the GPU’s SMs (visible in Nsight) or VRAM, which is why I ask.

For instance, EC2 autoscaling typically has a policy configured and attached signals such as CPU %, but not everything will depend CPU, and often even “low load” calls will peg GPU% for a short period. EC2 offers “nvidia_smi_utilization_gpu” and “nvidia_smi_utilization_memory” metrics as of early this year, and has statistic values like “average” “percent”, etc. Can you share how your scaling policies are configured, and have you done any profiling on that to give example scenarios for shorter and longer call scaling behaviors?

I’m assuming given the presentation that this are more or less single threaded/synchronous at the instance level without your API consumers modifying/customizing containers? Yes, I understand you’re using a load balancer in front of the raw instances.

To give context, I’m familiar with AWS EC2 and ECS scaling policies and more typical use cases. ex. Step and simple scaling policies for Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling

Or perhaps another lead-on question, I’m wondering if you have any synthetic metrics or are using out of the box signals such as nvidia_smi_utilization_gpu, and if you see any amplitude ringing on scaling signals, and what behavior looks like upon deployment.

And finally, do you offer any way to tune scaling? What visibility is there to instance count in real-time and historically?

Thanks.

-Victor
Panopstor

philschmid · September 28, 2022, 5:38pm

Hello Victor,

We kept the autoscaling pretty simple and straightforward for now to avoid complexity. The endpoint will be scaled between min and max replica based on the CPU/GPU utilization. The Threshold is around 70-80% over a window of 1-3minutes. You can see the current replicas on the analytics page or by using the API to get the endpoints information.

If there is interest from your/customer side, we can improve the customization of this feature.

Is there a specific use case/reason why you need those detailed?

panopstor · September 30, 2022, 2:01pm

Thanks for the response. That answer satisfies my curiosity. I was just looking for a sort of baseline expected behavior. I’ll take a look at the API for endpoint info a bit more.

Topic		Replies	Views
Integration and Scale Inference Endpoints on the Hub	2	53	September 11, 2024
Increase quota for Inference Endpoint Inference Endpoints on the Hub	4	179	January 31, 2025
Bug report: Inference endpoints with CPU does not wake up when they scale to zero Site Feedback	0	108	June 19, 2024
Auo-replicas is not working Inference Endpoints on the Hub	0	232	August 3, 2023
Autoscaling is turned on to min replicas as 0. Yet costing money? Inference Endpoints on the Hub	2	509	August 11, 2023

API Scaling algoritms and configuration

Related topics