How to deploy a T5 model to AWS SageMaker for fast inference?

Thanks @philschmid for this information about T5 in Sagemaker Inference (no compression until today).

I used the translation script (I used the script locally in an AWS Sagemaker notebook instance as I did some changes in the script). It has a requirements.txt (see my modified content) but this file did not install `transformers==4.15:

# content of my modified requirements.txt file
accelerate
datasets >= 1.16.0
sentencepiece != 0.1.92
protobuf
sacrebleu >= 1.4.12
py7zr
torch >= 1.3
jiwer

Then, I did train my T5 model on AWS Sagemaker Training DLC with libraries versions from Reference >> Training DLC Overview. As showed in the following screenshot and code from my notebook, I used transformers==4.12.3 and Pytorch 1.9.1:

print(sagemaker.__version__)
# 2.72.1

huggingface_estimator = HuggingFace(
      base_job_name=base_job_name,
      checkpoint_s3_uri=checkpoint_s3_bucket,
      checkpoint_local_path=checkpoint_local_path,
      entry_point='run_translation.py',
      source_dir='./translation',
      instance_type='ml.p3.2xlarge',
      instance_count=1,
      transformers_version='4.12.3',
      pytorch_version='1.9.1',
      py_version='py38',
      hyperparameters = hyperparameters,
      (...)
)

Then, I uploaded my T5 model to HF model hub in private mode.

Finally, I did use AWS Sagemaker Inference with the same libraries versions in the following code:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker 

role = sagemaker.get_execution_role()

hub = {
  'HF_MODEL_ID':'xxxxxxx', # model_id from hf.co/models
  'HF_TASK':'text2text-generation', 
  'HF_API_TOKEN':"xxxxxxxx" # my API token
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.12.3", # transformers version used
   pytorch_version="1.9.1", # pytorch version used
   py_version="py38", # python version of the DLC
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge" 
)

input_text = "xxxx"

data= {
    "inputs":input_text,
    "parameters": {
        "max_length":32, # same value than the one used for training
        "num_beams":1, # same value than the one used for training
        "early_stopping":True # same value than the one used for training
    }
}

# request
predictor.predict(data)

However, as said in my post, the predictions from predictor.predict(data) is different than the ones I get in a Colab notebook with the same Pytorch model and same arguments (num_beams,…).

what do you think? Thank you for your help.