How to set MMS default_workers_per_model via Hugging Face SageMaker Hosting?

OlivierCR · May 3, 2022, 3:31pm

Hi,

I have a model too small to fully occupy my GPU memory. I’m curious about packing multiple model serving replicas in the same GPU to maximize concurrent serving capacity. What environment variable should be set through SageMaker SDK to set the MMS control default_workers_per_model ?

philschmid · May 3, 2022, 4:11pm

You can use model_server_workers REF to control the HTTP Workers.
But AFAIK this shouldn’t work for GPU. Since when using GPU the endpoint always default to the number of GPUs available

OlivierCR · May 4, 2022, 9:35am

oh nice I didn’t know the control was available through the SDK! but what does it do? when I look in GitHub it brings me here and I don’t see how HF hosting uses it

I think this is the MMS env variables I was looking for:

MMS_DEFAULT_WORKERS_PER_MODEL

What I’d like to achieve is load multiple copies of the same model on GPU ; even though GPU compute won’t be run fully concurrently, the CPU part will and in some instances I saw it increase the throughput capacity of GPU endpoints

Topic		Replies	Views
How to configure GPU server-side batching with SageMaker HF Hosting? Amazon SageMaker	1	673	May 4, 2022
Issues using GPU with HuggingFace (TensorFlow) model deployed to SageMaker endpoint Amazon SageMaker	0	617	December 12, 2023
503 No worker is available when calling single huggingface endpoint Amazon SageMaker	11	4305	April 7, 2022
Sagemaker Serverless Inference Amazon SageMaker	22	8995	May 22, 2024
When to use SageMaker multi model endpoint Amazon SageMaker	3	2641	November 16, 2022

How to set MMS default_workers_per_model via Hugging Face SageMaker Hosting?

Related topics