Using a dataset with already masked tokens

Lindasteel · January 29, 2021, 5:26pm

I am trying to fine tune BERT for Masked Language Modeling and I would like to use a dataset that already contains masked tokens (I want to mask particular words rather than randomly chosen ones).
How can I do this?
I am following these https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb#scrollTo=KDBi0reX3l_g
instructions, but I am not sure which parts of the code I need to change for it to be compatible with a dataset that already has [MASK] tokens in it.
Thanks!

sgugger · January 29, 2021, 7:20pm

The masking is done by the data collator DataCollatorForLanguageModeling. Just pass along mlm=False to that data collator to deactivate the random masking there.

Lindasteel · February 3, 2021, 4:47pm

Thank you so much!

Topic		Replies	Views
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	162	April 22, 2024
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1042	October 18, 2021
Does masked language modeling DataCollator resembles BERT exactly? If not, how to do it like in BERT? Beginners	1	295	February 14, 2022
How can I see the masked words during pre-learning by MLM? 🤗Transformers	0	252	February 7, 2022
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7269	August 17, 2020

Using a dataset with already masked tokens

Related topics