How to use fine tuned Hugging face model saved at S3 at inference time?

Hi All,

I am new to transformers and I am trying to solve a text classification problem. I am using the transformers library to import a pre trained transformer to sagemaker and fine tune it on my dataset following the steps in this link .(notebooks/sagemaker-notebook.ipynb at main · huggingface/notebooks · GitHub).
Now that my model data is saved at an S3 location, I want to use it at inference time. I am using below code to create a HuggingFaceModel object to read in my model data and run prediction by deploying it at an endpoint.

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.6",                           # Transformers version used
   pytorch_version="1.7",                                # PyTorch version used
   py_version='py36',                                    # Python version used

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."

# request

However, I am not sure how can I use the “predictor” or the HuggingFaceModel object to get the following things at inference time -

  1. Class probabilities - predictor.predict() gives me the final class label and a score whereas I want to see the class probabilities/logits. How can I get them?

  2. The hidden states and layers of the fine tuned model - If I read a model directly from hub , I could use the below code to get logits, hidden layer etc using below code -
    model = AutoModelForSequnceClassification.from_pretrained(“microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext”)
    tokenizer = AutoTokenizer.from_pretrained(“microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext”)
    inputs = tokenizer(“test string”, return_tensors=“pt”)
    labels = torch.tensor([1]).unsqueeze(0)
    outputs = model(**inputs, labels=labels,output_hidden_states=True, output_attentions=True)
    output[0] will give loss
    output[1] will give logits and so on…
    but this is not applicable when I am reading my fine tuned model in the HuggingFaceModel class object.

  3. At inference time , how do I truncate my text length to ensure there are only 512 tokens passed to the model, since I cannot use the same tokenizer at training time that was used at inference. I tried using setting the truncation parameter true as below(suggested here How are the inputs tokenized when model deployment? - #12 by philschmid) but it does not work for me -

long_sentence = “…” # longer than 512 tokens
sentiment_input= {
‘parameters’: {‘truncation’:True}

Looking forward to your replies. Thank you.

Hi Sonali,

I believe the SageMaker Hugging Face Inference Toolkit should address all of your questions - in particular the ability to override the default methods of the HuggingFaceHandlerService.

With this ability you can provide your own custom logic at the time of inference. For example, you could override the input_fn() module to truncate the model input to 512 tokens and the output_fn() module to return logits, hidden states, etc.

See also this notebook for an end-to-end example: notebooks/sagemaker-notebook.ipynb at main · huggingface/notebooks · GitHub

Hope that helps.


1 Like