I am thinking of fine tuning model by training Language Model from scratch. I have couple of basic questions related to this:
I wanted to use whole-word-masking in training LM from scratch. I could not have found how to apply this option using Trainer.
Here is my data-set and code:
text=['I am huggingface fan', 'I love huggingface', ....]
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = tr.Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_data
)
trainer.train()
But it doesn’t take into account whole word masking.
-
How can I use this function to train LM on whole word masking using Pytorch Trainer?
-
How can I train on larger sequences which are greater than models max-length using Pytorch Trainer?