Using a dataset with already masked tokens

I am trying to fine tune BERT for Masked Language Modeling and I would like to use a dataset that already contains masked tokens (I want to mask particular words rather than randomly chosen ones).
How can I do this?
I am following these https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb#scrollTo=KDBi0reX3l_g
instructions, but I am not sure which parts of the code I need to change for it to be compatible with a dataset that already has [MASK] tokens in it.
Thanks!

The masking is done by the data collator DataCollatorForLanguageModeling. Just pass along mlm=False to that data collator to deactivate the random masking there.

1 Like

Thank you so much!