Using a dataset with already masked tokens

I am trying to fine tune BERT for Masked Language Modeling and I would like to use a dataset that already contains masked tokens (I want to mask particular words rather than randomly chosen ones).
How can I do this?
I am following these
instructions, but I am not sure which parts of the code I need to change for it to be compatible with a dataset that already has [MASK] tokens in it.

The masking is done by the data collator DataCollatorForLanguageModeling. Just pass along mlm=False to that data collator to deactivate the random masking there.

1 Like

Thank you so much!