Train masked language model with custom data on sagemaker

I have some small text corpus I managed to train on with colab here. It’s basically adapted from the EsperBerto example.

I looked at the HF sagemaker training example and this example. I cant figure out how to adapt/set the hyper-parameters , estimator params and how to load the correct dataloader and tokenizer files to S3 to do mlm training on SM.

Does anyone have any advice? is there good boilerplate code that can help me?

thanks, eric.

Hi Eric - this repo has lots of boilerplate code and should be very useful to get started: notebooks/sagemaker at master · huggingface/notebooks · GitHub

Cheers
Heiko

1 Like