I am training a MLM model using Roberta-XLM large
model.
Here is the standard code.
tokenizer = tr.XLMRobertaTokenizer.from_pretrained("xlm-roberta-large",local_files_only=True)
model = tr.XLMRobertaForMaskedLM.from_pretrained("xlm-roberta-large", return_dict=True,local_files_only=True)
df=pd.read_csv("training_data_multilingual.csv")
train_df=df.message_text.tolist()
train_df=list(set(train_df))
train_df = [x for x in train_df if str(x) != 'nan']
train_encodings = tokenizer(train_df, truncation=True, padding=True, max_length=512, return_tensors="pt")
class SEDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
return item
def __len__(self):
return len(self.encodings["attention_mask"])
train_data = SEDataset(train_encodings)
# print("train data created")
training_args = tr.TrainingArguments(
output_dir='results_mlm_vocab_exp'
,logging_dir='logs_mlm_vocab_exp' # directory for storing logs
,save_strategy="epoch"
,learning_rate=2e-5
,logging_steps=6000
,overwrite_output_dir=True
,num_train_epochs=10
,per_device_train_batch_size=2
,prediction_loss_only=True
,gradient_accumulation_steps=4
,bf16=True #Ampere GPU
,optim="adamw_hf"
)
trainer = tr.Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_data
)
trainer.train()
I have few question related to this:
- How loss is calculated in MLM training? I see during training these logs are printed
{'loss': 1.6117, 'learning_rate': 1.751861042183623e-05, 'epoch': 2.48}
. I guess it’s training loss? If so how its calculated? - How to pass validation data inside
TrainingArguments
? Is it same as training data? - Is it logical to get precision, recall, F1 score for training and validation data for MLM training? If so then how to achieve it using
Trainer
?