Finetuning mT5 for specific language pair

I’ve been working on a finetuning class for mT5 to enable easy creation of finetuned versions for various language pairs (similarly how MarianMT was delivered)
Initially I had my script based on my successful MarianMT but it didn’t work too well with mT5. I documented my doubts here

Then I applied the advice found in that thread to try to make my configuration more inline with the best practices. However I’m not sure if that’s all I need and I still have some concerns. Perhaps someone can offer some guidance?

I let the collator handle padding and added add_special_tokens=True, to include EOS automatically. Still unsure about passing 'input_ids' as model_inputs['labels'] would be good to get some clarification.
Since I’m finetuning only for one specific task, I removed the task prefix, to make the model do the task by default, not sure if this is optimal?

    def _preprocess_function(self, examples):
        """
        Tokenize and preprocess a batch of examples for model training.

        """
        # Add task-specific prefix
        prefix = f""
        inputs = [prefix + text for text in examples[self.src_key]]
        targets = examples[self.mt_key]
        model_inputs = self.tokenizer(
            inputs, 
            max_length=self.samples_filter_max_tokens, 
            #padding='longest', 
            add_special_tokens=True,
            truncation=True
            )

        labels = self.tokenizer(
                targets, 
                max_length=self.samples_filter_max_tokens, 
                #padding='longest', 
                add_special_tokens=True,
                truncation=True)

        # Replace pad token ID with -100 to ignore in loss computation
        # labels["input_ids"] = [
        #     [(label if label != self.tokenizer.pad_token_id else -100) for label in labels_seq]
        #     for labels_seq in labels["input_ids"]
        # ]
        model_inputs['labels'] = labels['input_ids']
        return model_inputs

Then in the collator I set the pad token

        data_collator = DataCollatorForSeq2Seq(
            tokenizer=self.tokenizer,
            model=self.model,
            label_pad_token_id=-100, 
        )

Then in training args I set adafactor=True, I’m expecting this will work with optimal arguments for Adafactor optimizer?
Since fp16 is now supposedly supported, I left it on.

        training_args = Seq2SeqTrainingArguments(
            adafactor=True, # Comment to use Adafactor defined below
            output_dir=self.output_dir,
            logging_steps=self.logging_steps,
            save_steps=self.save_steps,
            eval_steps=self.eval_steps,
            eval_strategy="steps",
            predict_with_generate=True,
            report_to="wandb" if not self.is_test_run else [],
            metric_for_best_model=self.metric_for_best_model,
            greater_is_better=True,
            load_best_model_at_end=True,
            save_total_limit=self.save_total_limit,
            learning_rate=self.learning_rate,
            weight_decay=self.weight_decay,
            per_device_train_batch_size=self.per_device_train_batch_size,
            per_device_eval_batch_size=self.per_device_eval_batch_size,
            gradient_accumulation_steps=self.gradient_accumulation_steps,
            auto_find_batch_size=self.auto_find_batch_size,
            num_train_epochs=self.num_train_epochs,
            run_name=f"{self.run_name}",
            seed=self.seed,
            fp16=True,  # Enable mixed precision if supported

        )

if this is suboptimal use of adafactor, I thought of defining the optimizire more explicitly and passing it to the trainer. Will it have exactly same effect or is there any advantage to doing this:

        # Initialize optimizer to have more control over the learning rate
        # optimizer = Adafactor(
        #     self.model.parameters(),
        #     lr=self.learning_rate,
        #     scale_parameter=False,
        #     relative_step=False,
        #     warmup_init=False
        # )
        
        trainer = Seq2SeqTrainer(
            model=self.model,
            args=training_args,
            compute_metrics=self._compute_mt_metrics,
            train_dataset=tokenized_train_dataset,
            eval_dataset=tokenized_eval_dataset,
            tokenizer=self.tokenizer,
            data_collator=data_collator,
            #optimizers=(optimizer, None), # Uncomment to use defined optimizer
            
        )

So these are all the changes I think that were needed.
I’m running a testrun. I see that grad_norms are still crazy big, compared to what I had on MarianMT at the start of the training runs (with much lower lr thought, 3 orders of magnitude lower LR there, but here the grad_norms are well over 6-7 orders of magnitude larger). They are better than what I had in previous runs thought.
image
Is that normal and expected? What were your grad_norms values with fintuning on a domain specific with suggested LRs?

1 Like