Create a batch transform job with custom trained biobert model

akash97715 · January 25, 2022, 6:28pm

Hi Team,

We have trained a biobert model on custom data using pytorch framework outside of sagemaker. we want to bring this model to sagemkaer to run the batch transform job on it.

Is there any different way we can try or any suggestion you have.

Thanks,
Akash

philschmid · January 25, 2022, 7:25pm

Hello @akash97715,

can the model be loaded with .from_pretrained? If so I see no problem for batch transform. It then just depends if your model is stored on S3 or Models - Hugging Face.

You can check-out this example: notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub

akash97715 · January 26, 2022, 4:18pm

yes we are loading the model from pretrained. Shall i load my custom trained model to S3 and i can use this?

Also the extension of the saved model is “model.pt”. do i need to covert this extension or directly i can use this for my batch transform jobs?

Thanks,
Akash

akash97715 · January 26, 2022, 4:20pm

Also PFA the code that we are using to saving the model in our local environment. We are planning to take this saved model and perform batch transform job on it.

marshmellow77 · January 26, 2022, 5:08pm

Hi Akash - in order to use your model on Sagemaker you will have to create a model package called model.tar.gz that includes all the required model files. You can find all the info here: Deploy models to Amazon SageMaker

Hope that helps!

Cheers
Heiko

akash97715 · January 26, 2022, 6:48pm

Hi Heiko,

Thank you for your response. When you say creating tar.gz file i understand your point.

Aa i mentioned I’m saving model in .pt extension as we are using pytorch framework.

Can you help me how to create a tar.gz file as I have saved model in .pt extension(refer to my code snippet). Does anybody done this earlier or do we have a sample notebook which explains how to convert .pt extension to .tar.gz file?

Or do i need to follow other approch while saving the model, i mean I need to save my pytorch model using other approach.

Thanks,
Akash

akash97715 · January 26, 2022, 6:52pm

Also one thing to keep in mind, i have a usecase where I need to use batch transform not any endpoint approach because we have multiple models for multiple hospital products and large data volume and creating multiple endpoints is not a feasible solution

marshmellow77 · January 26, 2022, 11:44pm

Hi Akash, there is no need for a different approach when you save your model. Once you have saved your model, create the model.tar.gz file with the following command: tar zcvf model.tar.gz * , just like described in the documentation I linked to earlier.

I looked for example notebooks and found this one: midas_depth_estimation.ipynb · GitHub

The markdowns are in Japanese (I think) but the code should be helpful for you to get an idea how to go about it.

Hope that helps.

Cheers
Heiko

akash97715 · January 28, 2022, 8:08am

Hi,
To answer your question we are not loading our custom trained biobert model from ".from_pretrained. we are using torch.load() to load the model, as i mentioned i am saving this model in .pt extension. The process we are following below:

we are downloading the pretrained model called “microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract” from model hub, we are applying transfer learning to it with our custom hospital data and then we are saving the model. Now we want to use this saved model to generate the classification labels on sagemaker.

Can you please guide me the way of saving model so that it can be loaded from “.from_pretrained”

Thanks,
Akash

philschmid · January 28, 2022, 10:42am

@akash97715 the easiest way to use the HuggingFace DLCs for batch transform job with zero-code configuration is by saving your model using transformers.
Meaning to would need to replace the torch.save in your training script with.

model.save_pretrained("my_dir")
tokenizer.save_pretrained("my_dir")

After that, you can create a compatible model.tar.gz

Create a tar file:

cd {my_dir}
tar zcvf model.tar.gz *

Upload model.tar.gz to S3:

aws s3 cp model.tar.gz <s3://{my-s3-path}>

Now you can provide the S3 URI to the model_data argument when creating your batch transform job.

akash97715 · February 19, 2022, 4:09pm

Thanks a lot for guiding us the right approach. I was able to implement it in my project

akash97715 · February 19, 2022, 4:16pm

Thank you for your precious suggestions. I was able to implement the inference pipeline in my project.

Moving to next question:
My task is to build retraining pipeling using sagemaker, that means i want to bring my model training code on sagemaker and try to train the model on the training data stored in S3 bucket. Also once the training is completed i want to save my transfer learning trained model back to s3 bucket and plan is to use that model for production prediction. We are using step function on AWS to build the orchestrator pipeline.

My question is how we can build a retraining pipeline using Pytorch estimator on sagemker. If you point me to some implementation github page link, it will be great

Thank you,
Akash

philschmid · February 21, 2022, 3:54pm

Hello @akash97715,

last oct-nov we did a whole workshop series on “Enterprise-Scale NLP with Hugging Face & Amazon SageMaker”. With part 1 training models using Amazon SageMaker, Part 2 Scaling Inference and Part 3 MLOPs using SageMaker pipelines.

All videos and resources are available on-demand at: GitHub - philschmid/huggingface-sagemaker-workshop-series: Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

akash97715 · February 22, 2022, 12:33pm

Thank you for providing the link.
I wanna understand how to convert normal dataset to “datasets.arrow_dataset.Dataset” format. I see you are loading imdb dataset from directly from load_dataset function from dataset[s3] module.

I have my csv file with 3 columns 1)unique id 2)description column(ML input) 3) Labels.
I want to convert this dataframe to the “datasets.arrow_dataset.Dataset” format.

philschmid · February 22, 2022, 12:50pm

@akash97715 we have excellent documentation for the datasets library, which also includes how to load csv datasets. Please take a look there. Load — datasets 1.18.3 documentation

I am always happy to help when you run into issues, but reading documentation has never hurt anyone.

>>> data_files = {"train": "train.csv", "test": "test.csv"}
>>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)

akash97715 · February 22, 2022, 6:52pm

Thank you for your response and really sorry for this silly question. I will definitely have look into the documentation from now. I looked when u provided the link. It was really awesome documentation and also this has clarified my further doubts as well, like how directly i can load from s3 bucket.

Thank you once again

Topic		Replies	Views
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2975	April 30, 2022
Create batch transform with existing model Amazon SageMaker	0	653	January 8, 2023
Save and deploy distilbert model in AWS SageMaker 🤗Transformers	2	2626	April 9, 2021
Batch_transform Pipeline? Amazon SageMaker	9	3436	September 28, 2021
Skip model repacking in Batch Transform Amazon SageMaker	0	384	August 15, 2023

Create a batch transform job with custom trained biobert model

Related topics