I want to fine-tune BERT on a specific dataset. My problem is that I do not want to mask some tokens of my training dataset randomly, but I already have chosen which tokens I want to mask (for certain reasons).
To do so, I created a dataset that has two columns: text
in which some tokens have been replaced with [MASK]
(I am aware of the fact that some words could be tokenised with more than one token and I took care of that) and label
where I have the whole text.
Now I want to fine-tune a BERT model (say, bert-base-uncased) using Hugging Face’s transformers
library, but I do not want to use DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.2)
where the masking is done randomly and I only can control the probability. What can I do?