Train masked language model with custom data on sagemaker

enpassant · February 10, 2022, 5:47pm

I have some small text corpus I managed to train on with colab here. It’s basically adapted from the EsperBerto example.

I looked at the HF sagemaker training example and this example. I cant figure out how to adapt/set the hyper-parameters , estimator params and how to load the correct dataloader and tokenizer files to S3 to do mlm training on SM.

Does anyone have any advice? is there good boilerplate code that can help me?

thanks, eric.

marshmellow77 · February 11, 2022, 3:08pm

Hi Eric - this repo has lots of boilerplate code and should be very useful to get started: notebooks/sagemaker at master · huggingface/notebooks · GitHub

Cheers
Heiko

Topic		Replies	Views
Incrementally finetuning a HF model in SageMaker Amazon SageMaker	6	901	May 4, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
Fine tune a BERT model in sagemaker using a custom dataset Beginners	0	742	November 19, 2021
Finetune molformer model Models	2	69	March 25, 2025
SageMaker Pipeline from model saved on S3 Amazon SageMaker	1	1182	September 9, 2022

Train masked language model with custom data on sagemaker

Related topics