Help for code

I’m using the SageMaker / Huggingface inference. For the model.tar.gz requested for the endpoint, I’m using this inference code:

import os
import torch
from transformers import AutoTokenizer, pipeline, T5Tokenizer


def model_fn(model_dir):
    model = torch.load(os.path.join(model_dir, T5_WEIGHTS_NAME))
    tokenizer = T5Tokenizer.from_pretrained(model_dir)

    if torch.cuda.is_available():
        device = 0
        device = -1

    generation = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=device, max_length=1024)

    return generation

I have worse performance than my local code like this:

tokenized_text = self.tokenizer_nl(
                input_text, truncation=True, padding="max_length", return_tensors="pt"

source_ids = tokenized_text["input_ids"].to(self.device, dtype=torch.long)
source_mask = tokenized_text["attention_mask"].to(self.device, dtype=torch.long)

generated_ids = self.model.generate(
    input_ids=source_ids, attention_mask=source_mask, max_length=1024

pred = self.tokenizer_sql.decode(

So I want to put this code in my code, without using the pipeline. But I don’t know how to write this inference. Can someone help me? Thanks!

Hello @Gennaro,

Which version are you using on SageMaker? and which version are you using locally? For transformers and pytorch.
Do you use a GPU on your local machine as well?

@philschmid I’m using on SageMaker: torch 1.9.1, transformers 4.12.3 and sagemaker 2.77.1

Locally I’m using torch 1.10.2, transformers 4.12.5

Both on GPU.

And what is the latency difference you see? Could test to have the same versions locally as well?

@philschmid There is a difference in latency because I have two different GPUs in local / remote mode, but it is not significant (it is very low). Using the same versions I have differences using the pipeline () method. I have also a difference in using a .pt or .bin model. I have now switched to the .bin model because I have fewer errors than .pt.

@Gennaro I might have misunderstood your question. Sorry
Since you have no switch to the pytorch_model.bin you should be able to deploy without the need to create a and just provide the env variables when creating the endpoint similar to the snippet below.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration.
hub = {

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge' # ec2 instance type

	'inputs': "Меня зовут Вольфганг и я живу в Берлине"

You can find more information in the documentation: Deploy models to Amazon SageMaker

@philschmid thank you for the answer. I’m using this code, but with my own model. So my code is:

role = sagemaker.get_execution_role()

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_data_url,  # path to your trained sagemaker model
   role=role, # iam role with permissions to create an Endpoint
   transformers_version=transformers_version, # "4.12.3"
   pytorch_version=pytorch_version, #"1.9.1"
   py_version=py_version # "py38"

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

My question is how to create my own, (or how to implement the model_fn and transform_fn methods, because I don’t want to use the pipeline() method but my implementation)

You can take a look here: Deploy models to Amazon SageMaker

Thank you. I don’t see examples/implementations. Are there purely practical examples from which to get ideas for creation?

@Gennaro i created an example on how to do it: notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub

1 Like

@philschmid Thank you! It is very useful.

1 Like