Fine tune Masked Language Model on custom dataset

Hi @smalltoken, what is the issue with https://huggingface.co/blog/how-to-train ?
This colab should help you. It walks you through,

  1. How to to train tokenizer from scratch
  2. Create RobertaModel using the config
  3. use the DataCollatorForLanguageModeling, which handle the masking
  4. and train using Trainer.