I have a model too small to fully occupy my GPU memory. I’m curious about packing multiple model serving replicas in the same GPU to maximize concurrent serving capacity. What environment variable should be set through SageMaker SDK to set the MMS control
You can use
model_server_workers REF to control the HTTP Workers.
But AFAIK this shouldn’t work for GPU. Since when using GPU the endpoint always default to the number of GPUs available
oh nice I didn’t know the control was available through the SDK! but what does it do? when I look in GitHub it brings me here and I don’t see how HF hosting uses it
I think this is the MMS env variables I was looking for:
What I’d like to achieve is load multiple copies of the same model on GPU ; even though GPU compute won’t be run fully concurrently, the CPU part will and in some instances I saw it increase the throughput capacity of GPU endpoints