Masking specific token in each input sentence during Masked language modelling

I have a dataset with 2 columns: token, sentence. For example:

{'token':'shrouded', 'sentence':'A mist shrouded the sun'}

I want to fine-tune one of the Huggingface Transformers model on a Masked Language Modelling task. (For now I am using distilroberta-base as per this tutorial)

Now, instead of random masking, I am trying to specifically mask the token in the sentence while training. For eg. A mist [MASK] the sun and then get the model to predict the token shrouded.

Now I understand that in random masking we can simply use DataCollatorForLanguageModeling and feed it into the Trainer. However, in this use case, masking will have to be done at the pre-processing stage. I can’t figure out how to do that.

Here is the code so far:

...

datasets = load_dataset('csv', data_files=['word_sentence_1.csv'])

model_checkpoint = "distilroberta-base"

def tokenize_function(examples):
    return tokenizer(examples["sentence"])


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4)

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-word_sentence_1_1",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

##### Need to remove this and add logic of static masking ####
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    data_collator=data_collator,
)

trainer.train()
1 Like