How to train TFBertForMaskedLM with TFTrainer

I want to do language modeling finetuning of a pretrained Bert on a custom corpus. I want to use TPUs on google cloud, so I want to work with TFTrainer to avoid writing own code and don’t worry about its performance. I can’t find any info on how the masking is actually supposed to be performed here, when using Trainer in Pytorch it seems that DataCollatorForLanguageModeling is taking care of this.

This is how my setup looks like:
tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)

    ds = load_dataset('text', data_files=[args.train_data_file])
    dataset = ds['train'].map(lambda examples: tokenizer(examples['text']), batched=True)
    dataset.set_format(type='tensorflow', columns=['text'])
    features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.max_len]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
    tfdataset =

    trainer = TFTrainer(model, training_args, tfdataset, optimizer = (tfa.optimizers.LAMB, None))


I’m using the datasets library to load a line-by-line txt file and convert it to a tensorflow dataset, this then goes straight to the trainer at the moment. Where and how in this process is the masking supposed to be added?

Hi, Just wanted to know if you were able to train the model using this code snippet?