Hi,
I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) from Hugging Face.
With the informations about how to deploy (timeline start: 28:14), I created a notebook instance (type: ml.m5.xlarge
) on AWS SageMaker where I did upload the notebook lab3_autoscaling.ipynb from huggingface-sagemaker-workshop-series >> workshop_2_going_production in github.
I ran it and got a inference time of about 70ms for the QA model (distilbert-base-uncased-distilled-squad). Great!
Then, I changed the model to be loaded from the HF model hub to t5-base with the following code:
hub = {
'HF_MODEL_ID':'t5-base', # model_id from hf.co/models
'HF_TASK':'translation' # NLP task you want to use for predictions
}
I did make the deploy through the following code:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.m5.xlarge"
)
And then, I did launch an inference… but the inference time goes up to more than 700ms!
As in the video (timeline start: 57:05), @philschmid said that there are still models that can not be deployed this way, I would like to check if T5 models (up to ByT5) are optimized or not for inference in AWS SageMaker (quantization through ONNX for example or not)?
If they are not yet optimized (as it looks like), when will they be?
Note: I noticed the same problem about T5 inference through the Inference API (see this thread: How to get Accelerated Inference API for T5 models? ).