I’ve been working on a finetuning class for mT5 to enable easy creation of finetuned versions for various language pairs (similarly how MarianMT was delivered)
Initially I had my script based on my successful MarianMT but it didn’t work too well with mT5. I documented my doubts here
Then I applied the advice found in that thread to try to make my configuration more inline with the best practices. However I’m not sure if that’s all I need and I still have some concerns. Perhaps someone can offer some guidance?
I let the collator handle padding and added add_special_tokens=True, to include EOS automatically. Still unsure about passing 'input_ids'
as model_inputs['labels']
would be good to get some clarification.
Since I’m finetuning only for one specific task, I removed the task prefix, to make the model do the task by default, not sure if this is optimal?
def _preprocess_function(self, examples):
"""
Tokenize and preprocess a batch of examples for model training.
"""
# Add task-specific prefix
prefix = f""
inputs = [prefix + text for text in examples[self.src_key]]
targets = examples[self.mt_key]
model_inputs = self.tokenizer(
inputs,
max_length=self.samples_filter_max_tokens,
#padding='longest',
add_special_tokens=True,
truncation=True
)
labels = self.tokenizer(
targets,
max_length=self.samples_filter_max_tokens,
#padding='longest',
add_special_tokens=True,
truncation=True)
# Replace pad token ID with -100 to ignore in loss computation
# labels["input_ids"] = [
# [(label if label != self.tokenizer.pad_token_id else -100) for label in labels_seq]
# for labels_seq in labels["input_ids"]
# ]
model_inputs['labels'] = labels['input_ids']
return model_inputs
Then in the collator I set the pad token
data_collator = DataCollatorForSeq2Seq(
tokenizer=self.tokenizer,
model=self.model,
label_pad_token_id=-100,
)
Then in training args I set adafactor=True
, I’m expecting this will work with optimal arguments for Adafactor optimizer?
Since fp16 is now supposedly supported, I left it on.
training_args = Seq2SeqTrainingArguments(
adafactor=True, # Comment to use Adafactor defined below
output_dir=self.output_dir,
logging_steps=self.logging_steps,
save_steps=self.save_steps,
eval_steps=self.eval_steps,
eval_strategy="steps",
predict_with_generate=True,
report_to="wandb" if not self.is_test_run else [],
metric_for_best_model=self.metric_for_best_model,
greater_is_better=True,
load_best_model_at_end=True,
save_total_limit=self.save_total_limit,
learning_rate=self.learning_rate,
weight_decay=self.weight_decay,
per_device_train_batch_size=self.per_device_train_batch_size,
per_device_eval_batch_size=self.per_device_eval_batch_size,
gradient_accumulation_steps=self.gradient_accumulation_steps,
auto_find_batch_size=self.auto_find_batch_size,
num_train_epochs=self.num_train_epochs,
run_name=f"{self.run_name}",
seed=self.seed,
fp16=True, # Enable mixed precision if supported
)
if this is suboptimal use of adafactor, I thought of defining the optimizire more explicitly and passing it to the trainer. Will it have exactly same effect or is there any advantage to doing this:
# Initialize optimizer to have more control over the learning rate
# optimizer = Adafactor(
# self.model.parameters(),
# lr=self.learning_rate,
# scale_parameter=False,
# relative_step=False,
# warmup_init=False
# )
trainer = Seq2SeqTrainer(
model=self.model,
args=training_args,
compute_metrics=self._compute_mt_metrics,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_eval_dataset,
tokenizer=self.tokenizer,
data_collator=data_collator,
#optimizers=(optimizer, None), # Uncomment to use defined optimizer
)
So these are all the changes I think that were needed.
I’m running a testrun. I see that grad_norms are still crazy big, compared to what I had on MarianMT at the start of the training runs (with much lower lr thought, 3 orders of magnitude lower LR there, but here the grad_norms are well over 6-7 orders of magnitude larger). They are better than what I had in previous runs thought.
Is that normal and expected? What were your grad_norms values with fintuning on a domain specific with suggested LRs?