Improve the throughput of HF Inference DLC

Vinayaks117 · July 22, 2022, 12:57pm

Hi Team,

Good day!!

We have deployed NER BERT model on HF Inference DLC (“763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04”) using boto3 API.

Instance type: p2.xlarge (4 CPU, 1 GPU)
Instance count: 2
Invocations per minute: 60
Invocations per instance: 60
GPUUtilization: 15-60 % per minute
CPUUtilization: Max 100%
Cloud watch logs:
Default workers per model: 1, W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Preprocess time - 0.06318092346191406 ms, W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1256, W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Predict time - 1253.3233165740967 ms, W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Postprocess time - 0.0054836273193359375 ms

By looking at the cloudwatch logs and CPUUtilization, HF Inference DLC utilizing only one worker or 1 cpu core.

It looks like we are under-utilizing the resources because we have two p2.xlarge instances which have 8 CPU cores and 2 GPU cores. If we increase the batch size then we are experiencing more failures and it might be because of no resources available on the single core which is being utilized.

How to increase the number of workers while deploying the BERT model on SM real-time endpoint using boto3 api?

Please share if there are any references on utilizing all available workers and implement multi-threading on HF inference DLCs.

Thanks,
Vinayak

philschmid · July 22, 2022, 2:27pm

Hello @Vinayaks117,

When using a GPU instance the DLC will start as many workers as GPUs are available.

Vinayaks117 · July 22, 2022, 2:47pm

Thanks @philschmid

But why are CPU resources not being utilized when using GPU based instance?

Please share if there is any documentation on resource utilization.

philschmid · July 22, 2022, 3:15pm

Since by default or without any complex setup you can only load 1 model on the GPU.

Vinayaks117 · July 22, 2022, 3:25pm

Ok. Surprisingly GPUMemoryUtilization is just 10% and GPUUtilization between 15-60 % per minute. Whenever we try to increase the batch size more than 60 RPM then we get more failures, it looks like SM endpoint can’t handle such load while there are enough GPU resources available.

FYI: We are directly invoking the SM real-time endpoint within the application.

Any ideas on how we can improve the throughput?

philschmid · July 23, 2022, 12:15pm

The memory utilization and GPU utilization are expected when you model is quite small compared to the input and model size.

What failure are you seeing?
Ways to improve would be to optimize your model or scale the instance or add batching to it.

Vinayaks117 · July 23, 2022, 1:03pm

I think we are not utilizing all GPU resources hence it makes sense to increase the batch size and when we try that then we get following errors. So we started scaling the instances to handle better work load and to avoid failures as well.

400 Client Error: Bad Request for url: Post Request

We know what is 400 - bad request but if we try the same requests individually then it works. I think SM itself can’t handle such workload and utilize all GPU resources.

Vinayaks117 · July 23, 2022, 2:29pm

Let’s say we want to deploy a DL model on an EC2 instance with Flask app, gunicorn server and nginx then we can increase the workers and add more threads to each worker to handle the workload well.

How can we add more threads to each worker while using HF DLC?

Vinayaks117 · July 27, 2022, 2:37pm

Hello @philschmid

We have deployed using a CPU based instance which has 36 CPU cores but HF DLC is starting only one worker. Please let us know how we can increase the workers explicitly?

As you mentioned earlier, while using GPU instances the DLC will start as many workers as GPUs are available and I believe it’s the same for CPU instances as well.

Thanks

philschmid · July 27, 2022, 2:55pm

Could please share the exact why and how you deployed the model, the instance type as well as the initial logs? there you should see how many workers are registered.

Vinayaks117 · July 28, 2022, 7:37am

Hi @philschmid

We have deployed it using boto3 API, able to see all available workers and are being utilized after load test.

As per my understanding we can have suggested no of workers = (2*CPU)+1

If we are using “ml.c5.xlarge” instance type, it has 4 CPU cores which means we can have max 9 workers, but HF DLC will create only 4 workers.

Is there any way we can increase the number of workers or is there any reason behind creating 4 workers only?

Thanks

philschmid · July 28, 2022, 7:58am

@Vinayaks117 without really seeing the code I cannot try to reproduce or help you. In our examples, we are starting with the correct number of workers.

Using the SageMaker SDK you can set model_server_workers=int

Vinayaks117 · August 13, 2022, 12:38pm

Thanks for sharing the details. It makes sense to go with default workers otherwise we will have memory issues.

How can we implement real-time dynamic batch inference using HF DLC?

philschmid · August 15, 2022, 7:15am

The HF DLC is based on multi-model-server, which supports dynamic batching through custom handlers and custom configurations. [Documentation].
To add this to you HF DLC you would need to fork the original DLC and change the configurations.

Topic		Replies	Views
Workers crashing in HF Inferentia inference Amazon SageMaker	3	2401	September 8, 2022
Getting "No worker is available to serve request: model" with HuggingFaceModel endpoint Amazon SageMaker	13	5128	March 22, 2022
503 No worker is available when calling single huggingface endpoint Amazon SageMaker	11	4393	April 7, 2022
How to configure GPU server-side batching with SageMaker HF Hosting? Amazon SageMaker	1	674	May 4, 2022
Impossible to use flan-t5-xxl in a batch-transform job Amazon SageMaker	3	1158	May 23, 2023

Improve the throughput of HF Inference DLC

Related topics