I’m hoping you can share a bit more about the Hugging Face API scaling algorithm.
Is there any hysteresis on scaling?
What are the signals used for scaling?
I’m curious how the system will react in a circumstance such as:
What happens when the GPU is pegged by a single call for a largish model for, say, 5-10 seconds? Nvidia-smi will typically report >95% GPU utilization without actually saturating the GPU’s SMs (visible in Nsight) or VRAM, which is why I ask.
For instance, EC2 autoscaling typically has a policy configured and attached signals such as CPU %, but not everything will depend CPU, and often even “low load” calls will peg GPU% for a short period. EC2 offers “nvidia_smi_utilization_gpu” and “nvidia_smi_utilization_memory” metrics as of early this year, and has statistic values like “average” “percent”, etc. Can you share how your scaling policies are configured, and have you done any profiling on that to give example scenarios for shorter and longer call scaling behaviors?
I’m assuming given the presentation that this are more or less single threaded/synchronous at the instance level without your API consumers modifying/customizing containers? Yes, I understand you’re using a load balancer in front of the raw instances.
To give context, I’m familiar with AWS EC2 and ECS scaling policies and more typical use cases. ex. Step and simple scaling policies for Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling
Or perhaps another lead-on question, I’m wondering if you have any synthetic metrics or are using out of the box signals such as nvidia_smi_utilization_gpu, and if you see any amplitude ringing on scaling signals, and what behavior looks like upon deployment.
And finally, do you offer any way to tune scaling? What visibility is there to instance count in real-time and historically?