Masking specific token in each input sentence during Masked language modelling

I have a dataset with 2 columns: token, sentence. For example:

{'token':'shrouded', 'sentence':'A mist shrouded the sun'}

I want to fine-tune one of the Huggingface Transformers model on a Masked Language Modelling task. (For now I am using distilroberta-base as per this tutorial)

Now, instead of random masking, I am trying to specifically mask the token in the sentence while training. For eg. A mist [MASK] the sun and then get the model to predict the token shrouded.

Now I understand that in random masking we can simply use DataCollatorForLanguageModeling and feed it into the Trainer. However, in this use case, masking will have to be done at the pre-processing stage. I can’t figure out how to do that.

Here is the code so far:


datasets = load_dataset('csv', data_files=['word_sentence_1.csv'])

model_checkpoint = "distilroberta-base"

def tokenize_function(examples):
    return tokenizer(examples["sentence"])

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets =, batched=True, num_proc=4)

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    evaluation_strategy = "epoch",

##### Need to remove this and add logic of static masking ####
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

trainer = Trainer(

1 Like