Hi,
I was trying to figure out how whole word masking effects to BERT.
So I used TextDatasetForNextSentencePrediction
with DataCollatorForWholeWordMask
.
But it gave me the following error during training phase:
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
1614 if isinstance(k, str):
1615 inner_dict = {k: v for (k, v) in self.items()}
-> 1616 return inner_dict[k]
1617 else:
1618 return self.to_tuple()[k]
KeyError: 'loss'
So I looked it up it happens when you don’t pass appropriate labels.
Then I dig deeper and it turns out DataCollatorForWholeWordMask
wasn’t handling the special_tokens
the same way DataCollatorForLanguageModeling
did.
Is there any other way to combine whole word masking
and next sentence prediction
?
Also since DataCollatorForWholeWordMask
inherits from DataCollatorForLanguageModeling
, isn’t it supposed to be replaceable for DataCollatorForLanguageModeling
?
Dataset Creation
dataset = TextDatasetForNextSentencePrediction(
tokenizer=bert_cased_tokenizer,
file_path="./tmp.txt",
block_size = 256
)
Data Collator
data_collator = DataCollatorForWholeWordMask(
tokenizer=bert_cased_tokenizer,
mlm=True,
mlm_probability= 0.15
)
Training
training_args = TrainingArguments(
output_dir=PATHS["model"]["cased"]["mlm-nsp"]["training"]["local"],
overwrite_output_dir=True,
num_train_epochs=2,
per_gpu_train_batch_size= 16
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()