Thanks @philschmid for this information about T5 in Sagemaker Inference (no compression until today).
I used the translation script (I used the script locally in an AWS Sagemaker notebook instance as I did some changes in the script). It has a requirements.txt (see my modified content) but this file did not install `transformers==4.15:
# content of my modified requirements.txt file
accelerate
datasets >= 1.16.0
sentencepiece != 0.1.92
protobuf
sacrebleu >= 1.4.12
py7zr
torch >= 1.3
jiwer
Then, I did train my T5 model on AWS Sagemaker Training DLC with libraries versions from Reference >> Training DLC Overview. As showed in the following screenshot and code from my notebook, I used transformers==4.12.3
and Pytorch 1.9.1
:
print(sagemaker.__version__)
# 2.72.1
huggingface_estimator = HuggingFace(
base_job_name=base_job_name,
checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path=checkpoint_local_path,
entry_point='run_translation.py',
source_dir='./translation',
instance_type='ml.p3.2xlarge',
instance_count=1,
transformers_version='4.12.3',
pytorch_version='1.9.1',
py_version='py38',
hyperparameters = hyperparameters,
(...)
)
Then, I uploaded my T5 model to HF model hub in private mode.
Finally, I did use AWS Sagemaker Inference with the same libraries versions in the following code:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID':'xxxxxxx', # model_id from hf.co/models
'HF_TASK':'text2text-generation',
'HF_API_TOKEN':"xxxxxxxx" # my API token
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
env=hub,
role=role, # iam role with permissions to create an Endpoint
transformers_version="4.12.3", # transformers version used
pytorch_version="1.9.1", # pytorch version used
py_version="py38", # python version of the DLC
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.m5.xlarge"
)
input_text = "xxxx"
data= {
"inputs":input_text,
"parameters": {
"max_length":32, # same value than the one used for training
"num_beams":1, # same value than the one used for training
"early_stopping":True # same value than the one used for training
}
}
# request
predictor.predict(data)
However, as said in my post, the predictions from predictor.predict(data)
is different than the ones I get in a Colab notebook with the same Pytorch model and same arguments (num_beams
,…).
what do you think? Thank you for your help.