Hi,
We have a situation where an HF text2text model inferred in a Python Quart server on EC2 m5.2xlarge is 3x faster (~350ms) than on SageMaker HF ml.m5.2xlarge endpoint (~1100ms). We have slight differences of PT and HF version (HF 4.7.1, PT1.10 on SM, HF4.8, PT1.11 on EC2). What could explain such a big difference?
- big difference in perf between HF 4.7.1 and 4.8 ?
- some serving overhead in SM HF (MMS) that needs to be tuned?
- PyTorch parallelism (OMP_NUM_THREADS? anything else?) in SM HF that needs to be tuned?