Hi @smalltoken, what is the issue with https://huggingface.co/blog/how-to-train ?
This colab should help you. It walks you through,
- How to to train tokenizer from scratch
- Create
RobertaModel
using the config - use the
DataCollatorForLanguageModeling
, which handle the masking - and train using
Trainer
.