I want to do language modeling finetuning of a pretrained Bert on a custom corpus. I want to use TPUs on google cloud, so I want to work with TFTrainer
to avoid writing own code and don’t worry about its performance. I can’t find any info on how the masking is actually supposed to be performed here, when using Trainer
in Pytorch it seems that DataCollatorForLanguageModeling
is taking care of this.
This is how my setup looks like:
tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
ds = load_dataset('text', data_files=[args.train_data_file])
dataset = ds['train'].map(lambda examples: tokenizer(examples['text']), batched=True)
dataset.set_format(type='tensorflow', columns=['text'])
features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.max_len]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
tfdataset = tf.data.Dataset.from_tensor_slices(features).batch(32)
trainer = TFTrainer(model, training_args, tfdataset, optimizer = (tfa.optimizers.LAMB, None))
trainer.train()
trainer.save_model(args.save_path)
I’m using the datasets library to load a line-by-line txt file and convert it to a tensorflow dataset, this then goes straight to the trainer at the moment. Where and how in this process is the masking supposed to be added?