How to train a LM model with whole word masking using Pytorch Trainer API

I am thinking of fine tuning model by training Language Model from scratch. I have couple of basic questions related to this:

I wanted to use whole-word-masking in training LM from scratch. I could not have found how to apply this option using Trainer.

Here is my data-set and code:

text=['I am huggingface fan', 'I love huggingface', ....]
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

trainer = tr.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data
)

trainer.train()

But it doesn’t take into account whole word masking.

  • How can I use this function to train LM on whole word masking using Pytorch Trainer?

  • How can I train on larger sequences which are greater than models max-length using Pytorch Trainer?