Why HF text2text model takes 1s in SageMaker, 300ms on EC2?


We have a situation where an HF text2text model inferred in a Python Quart server on EC2 m5.2xlarge is 3x faster (~350ms) than on SageMaker HF ml.m5.2xlarge endpoint (~1100ms). We have slight differences of PT and HF version (HF 4.7.1, PT1.10 on SM, HF4.8, PT1.11 on EC2). What could explain such a big difference?

  • big difference in perf between HF 4.7.1 and 4.8 ?
  • some serving overhead in SM HF (MMS) that needs to be tuned?
  • PyTorch parallelism (OMP_NUM_THREADS? anything else?) in SM HF that needs to be tuned?

How many HTTP Workers does Quart use and how many CPU cores does the model use?
By default, SageMaker will start one worker per CPU core meaning that the model might also only use 1x CPU.