How to set MMS default_workers_per_model via Hugging Face SageMaker Hosting?

Hi,

I have a model too small to fully occupy my GPU memory. I’m curious about packing multiple model serving replicas in the same GPU to maximize concurrent serving capacity. What environment variable should be set through SageMaker SDK to set the MMS control default_workers_per_model ?

You can use model_server_workers REF to control the HTTP Workers.
But AFAIK this shouldn’t work for GPU. Since when using GPU the endpoint always default to the number of GPUs available

oh nice I didn’t know the control was available through the SDK! but what does it do? when I look in GitHub it brings me here and I don’t see how HF hosting uses it

I think this is the MMS env variables I was looking for:

  • MMS_DEFAULT_WORKERS_PER_MODEL

What I’d like to achieve is load multiple copies of the same model on GPU ; even though GPU compute won’t be run fully concurrently, the CPU part will and in some instances I saw it increase the throughput capacity of GPU endpoints