I would like to know what words are masked during pre-learning by masked language modeling.
How can I see the masked words during pre-learning?
For example, Below is sample code.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=corpus,
block_size=max_length,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir=outputdir,
overwrite_output_dir=False,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
save_steps=2000,
save_total_limit=2,
prediction_loss_only=True,
logging_steps=2000,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset
)
trainer.train()
Thank you.