I have a dataset with 2 columns: token, sentence
. For example:
{'token':'shrouded', 'sentence':'A mist shrouded the sun'}
I want to fine-tune one of the Huggingface Transformers model on a Masked Language Modelling task. (For now I am using distilroberta-base
as per this tutorial)
Now, instead of random masking, I am trying to specifically mask the token
in the sentence
while training. For eg. A mist [MASK] the sun
and then get the model to predict the token shrouded
.
Now I understand that in random masking we can simply use DataCollatorForLanguageModeling
and feed it into the Trainer
. However, in this use case, masking will have to be done at the pre-processing stage. I can’t figure out how to do that.
Here is the code so far:
...
datasets = load_dataset('csv', data_files=['word_sentence_1.csv'])
model_checkpoint = "distilroberta-base"
def tokenize_function(examples):
return tokenizer(examples["sentence"])
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
f"{model_name}-word_sentence_1_1",
evaluation_strategy = "epoch",
learning_rate=2e-5,
weight_decay=0.01,
push_to_hub=False,
)
##### Need to remove this and add logic of static masking ####
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
data_collator=data_collator,
)
trainer.train()