Improve the throughput of HF Inference DLC

Hi Team,

Good day!!

We have deployed NER BERT model on HF Inference DLC (“”) using boto3 API.

Instance type: p2.xlarge (4 CPU, 1 GPU)
Instance count: 2
Invocations per minute: 60
Invocations per instance: 60
GPUUtilization: 15-60 % per minute
CPUUtilization: Max 100%
Cloud watch logs:
Default workers per model: 1, W-model-1-stdout - Preprocess time - 0.06318092346191406 ms, W-9000-model - Backend response time: 1256, W-model-1-stdout - Predict time - 1253.3233165740967 ms, W-model-1-stdout - Postprocess time - 0.0054836273193359375 ms

By looking at the cloudwatch logs and CPUUtilization, HF Inference DLC utilizing only one worker or 1 cpu core.

It looks like we are under-utilizing the resources because we have two p2.xlarge instances which have 8 CPU cores and 2 GPU cores. If we increase the batch size then we are experiencing more failures and it might be because of no resources available on the single core which is being utilized.

How to increase the number of workers while deploying the BERT model on SM real-time endpoint using boto3 api?

Please share if there are any references on utilizing all available workers and implement multi-threading on HF inference DLCs.


Hello @Vinayaks117,

When using a GPU instance the DLC will start as many workers as GPUs are available.

Thanks @philschmid

But why are CPU resources not being utilized when using GPU based instance?

Please share if there is any documentation on resource utilization.

Since by default or without any complex setup you can only load 1 model on the GPU.

Ok. Surprisingly GPUMemoryUtilization is just 10% and GPUUtilization between 15-60 % per minute. Whenever we try to increase the batch size more than 60 RPM then we get more failures, it looks like SM endpoint can’t handle such load while there are enough GPU resources available.

FYI: We are directly invoking the SM real-time endpoint within the application.

Any ideas on how we can improve the throughput?

The memory utilization and GPU utilization are expected when you model is quite small compared to the input and model size.

What failure are you seeing?
Ways to improve would be to optimize your model or scale the instance or add batching to it.

I think we are not utilizing all GPU resources hence it makes sense to increase the batch size and when we try that then we get following errors. So we started scaling the instances to handle better work load and to avoid failures as well.

400 Client Error: Bad Request for url: Post Request

We know what is 400 - bad request but if we try the same requests individually then it works. I think SM itself can’t handle such workload and utilize all GPU resources.

1 Like

Let’s say we want to deploy a DL model on an EC2 instance with Flask app, gunicorn server and nginx then we can increase the workers and add more threads to each worker to handle the workload well.

How can we add more threads to each worker while using HF DLC?

Hello @philschmid

We have deployed using a CPU based instance which has 36 CPU cores but HF DLC is starting only one worker. Please let us know how we can increase the workers explicitly?

As you mentioned earlier, while using GPU instances the DLC will start as many workers as GPUs are available and I believe it’s the same for CPU instances as well.


Could please share the exact why and how you deployed the model, the instance type as well as the initial logs? there you should see how many workers are registered.

Hi @philschmid

We have deployed it using boto3 API, able to see all available workers and are being utilized after load test.

As per my understanding we can have suggested no of workers = (2*CPU)+1

If we are using “ml.c5.xlarge” instance type, it has 4 CPU cores which means we can have max 9 workers, but HF DLC will create only 4 workers.

Is there any way we can increase the number of workers or is there any reason behind creating 4 workers only?


@Vinayaks117 without really seeing the code I cannot try to reproduce or help you. In our examples, we are starting with the correct number of workers.

Using the SageMaker SDK you can set model_server_workers=int

Thanks for sharing the details. It makes sense to go with default workers otherwise we will have memory issues.

How can we implement real-time dynamic batch inference using HF DLC?

The HF DLC is based on multi-model-server, which supports dynamic batching through custom handlers and custom configurations. [Documentation].
To add this to you HF DLC you would need to fork the original DLC and change the configurations.