I am training a MLM
model using Pytorch Trainer API
. Here is my initial code.
data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
class SEDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
return item
def __len__(self):
return len(self.encodings["attention_mask"])
train_data = SEDataset(train_encodings)
print("train_data prepared")
training_args = tr.TrainingArguments(
output_dir='results_mlm_mmt2'
,logging_dir='logs_mlm_mmt2' # directory for storing logs
,save_strategy="epoch"
,learning_rate=2e-5
,logging_steps=40000
,overwrite_output_dir=True
,num_train_epochs=10
,per_device_train_batch_size=32
,prediction_loss_only=True
,gradient_accumulation_steps=2
,fp16=True
)
trainer = tr.Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_data
)
The above code works fine but I want to include few things:
-
How can I include
validation and test text data
to it and in which format? Do
I also need to pass labels for validation set? -
How can I include some metrics related to MLM to get printed after
every #steps? -
How can I test my trained model?