Create a batch transform job with custom trained biobert model

Hi Team,

We have trained a biobert model on custom data using pytorch framework outside of sagemaker. we want to bring this model to sagemkaer to run the batch transform job on it.

Is there any different way we can try or any suggestion you have.


Hello @akash97715,

can the model be loaded with .from_pretrained? If so I see no problem for batch transform. It then just depends if your model is stored on S3 or Models - Hugging Face.

You can check-out this example: notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub

yes we are loading the model from pretrained. Shall i load my custom trained model to S3 and i can use this?

Also the extension of the saved model is “”. do i need to covert this extension or directly i can use this for my batch transform jobs?


Also PFA the code that we are using to saving the model in our local environment. We are planning to take this saved model and perform batch transform job on it.

Hi Akash - in order to use your model on Sagemaker you will have to create a model package called model.tar.gz that includes all the required model files. You can find all the info here: Deploy models to Amazon SageMaker

Hope that helps!


Hi Heiko,

Thank you for your response. When you say creating tar.gz file i understand your point.

Aa i mentioned I’m saving model in .pt extension as we are using pytorch framework.

Can you help me how to create a tar.gz file as I have saved model in .pt extension(refer to my code snippet). Does anybody done this earlier or do we have a sample notebook which explains how to convert .pt extension to .tar.gz file?

Or do i need to follow other approch while saving the model, i mean I need to save my pytorch model using other approach.


Also one thing to keep in mind, i have a usecase where I need to use batch transform not any endpoint approach because we have multiple models for multiple hospital products and large data volume and creating multiple endpoints is not a feasible solution

Hi Akash, there is no need for a different approach when you save your model. Once you have saved your model, create the model.tar.gz file with the following command: tar zcvf model.tar.gz * , just like described in the documentation I linked to earlier.

I looked for example notebooks and found this one: midas_depth_estimation.ipynb · GitHub

The markdowns are in Japanese (I think) but the code should be helpful for you to get an idea how to go about it.

Hope that helps.


To answer your question we are not loading our custom trained biobert model from ".from_pretrained. we are using torch.load() to load the model, as i mentioned i am saving this model in .pt extension. The process we are following below:

we are downloading the pretrained model called “microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract” from model hub, we are applying transfer learning to it with our custom hospital data and then we are saving the model. Now we want to use this saved model to generate the classification labels on sagemaker.

Can you please guide me the way of saving model so that it can be loaded from “.from_pretrained”


@akash97715 the easiest way to use the HuggingFace DLCs for batch transform job with zero-code configuration is by saving your model using transformers.
Meaning to would need to replace the in your training script with.


After that, you can create a compatible model.tar.gz

  1. Create a tar file:
cd {my_dir}
tar zcvf model.tar.gz *
  1. Upload model.tar.gz to S3:
aws s3 cp model.tar.gz <s3://{my-s3-path}>

Now you can provide the S3 URI to the model_data argument when creating your batch transform job.

Thanks a lot for guiding us the right approach. I was able to implement it in my project :slight_smile:

1 Like

Thank you for your precious suggestions. I was able to implement the inference pipeline in my project.

Moving to next question:
My task is to build retraining pipeling using sagemaker, that means i want to bring my model training code on sagemaker and try to train the model on the training data stored in S3 bucket. Also once the training is completed i want to save my transfer learning trained model back to s3 bucket and plan is to use that model for production prediction. We are using step function on AWS to build the orchestrator pipeline.

My question is how we can build a retraining pipeline using Pytorch estimator on sagemker. If you point me to some implementation github page link, it will be great

Thank you,

Hello @akash97715,

last oct-nov we did a whole workshop series on “Enterprise-Scale NLP with Hugging Face & Amazon SageMaker”. With part 1 training models using Amazon SageMaker, Part 2 Scaling Inference and Part 3 MLOPs using SageMaker pipelines.

All videos and resources are available on-demand at: GitHub - philschmid/huggingface-sagemaker-workshop-series: Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Thank you for providing the link.
I wanna understand how to convert normal dataset to “datasets.arrow_dataset.Dataset” format. I see you are loading imdb dataset from directly from load_dataset function from dataset[s3] module.

I have my csv file with 3 columns 1)unique id 2)description column(ML input) 3) Labels.
I want to convert this dataframe to the “datasets.arrow_dataset.Dataset” format.

@akash97715 we have excellent documentation for the datasets library, which also includes how to load csv datasets. Please take a look there. Load — datasets 1.18.3 documentation

I am always happy to help when you run into issues, but reading documentation has never hurt anyone.

>>> data_files = {"train": "train.csv", "test": "test.csv"}
>>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)

Thank you for your response and really sorry for this silly question. I will definitely have look into the documentation from now. I looked when u provided the link. It was really awesome documentation and also this has clarified my further doubts as well, like how directly i can load from s3 bucket. :slight_smile:

Thank you once again

1 Like