Why HF text2text model takes 1s in SageMaker, 300ms on EC2?

OlivierCR · May 23, 2022, 2:58pm

Hi,

We have a situation where an HF text2text model inferred in a Python Quart server on EC2 m5.2xlarge is 3x faster (~350ms) than on SageMaker HF ml.m5.2xlarge endpoint (~1100ms). We have slight differences of PT and HF version (HF 4.7.1, PT1.10 on SM, HF4.8, PT1.11 on EC2). What could explain such a big difference?

big difference in perf between HF 4.7.1 and 4.8 ?
some serving overhead in SM HF (MMS) that needs to be tuned?
PyTorch parallelism (OMP_NUM_THREADS? anything else?) in SM HF that needs to be tuned?

philschmid · May 23, 2022, 3:31pm

How many HTTP Workers does Quart use and how many CPU cores does the model use?
By default, SageMaker will start one worker per CPU core meaning that the model might also only use 1x CPU.

Topic		Replies	Views
Improve the throughput of HF Inference DLC Amazon SageMaker	13	1407	August 15, 2022
How to deploy a T5 model to AWS SageMaker for fast inference? Amazon SageMaker	13	5779	February 28, 2022
Increase summarization speed of llama-2-7b-chat-hf Beginners	0	1136	September 18, 2023
How to configure GPU server-side batching with SageMaker HF Hosting? Amazon SageMaker	1	673	May 4, 2022
Model Stream Error - Streaming times out after 60 seconds Amazon SageMaker	0	336	May 15, 2024

Why HF text2text model takes 1s in SageMaker, 300ms on EC2?

Related topics