T5 Finetuning Tips

tsei902 · March 25, 2023, 6:29pm

Did you solve your issue? I think passing the optimizer is enough, you dont need to pass it again as optim=‘adafactor’ in the Training Arguments.

raphaelmerx · May 19, 2023, 7:16am

Sharing my results with transfer learning flan-t5-small for translation.

Experiment 1:

Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)
# no scheduler

Experiment 2:

optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)

Got faster convergence with experiment 1, but better end performance with experiment 2:

benjpau · June 4, 2023, 7:03am

Hi! @pierreguillou
I tried the method you mentioned, i.e. using adafactor in the Trainer of hugginface transformers to fine tune the original version of T5.
The version of transformers I am using is 4.28.1.

I used the run_translation.py script like you did. The script defaults to AdamW. According to the latest transformers documentation, I use “–optim adafactor” to select adafactor and "–learning_rate 1e-3 " to set the learning_rate to 1e-3.

Basically, this is the same way you are using it. Ultimately, I did not observe the issue you mentioned regarding eval_loss and learning_rate. However, the results I got show that using adafactor is not as good as AdamW, all other parameters being equal. on my own dataset, of course.

I would like to ask, do you have any new findings and tips about using adafactor in Trainer?

sharbel · November 29, 2023, 1:34am

Some people have mentioned multi-task fine tuning of T5 and I am wondering if anyone has been successful in fine tuning T5 on different task types. For example, fine tuning Q&A as well as Summarization in the same model. I can train them separately using their corresponding models (eg: AutoModelForQuestionAnswering or AutoModelForSeq2SeqLM). However, when I try to merge the datasets using interleaving or concating, and use T5ForConditionalGeneration to cover both tasks, only the Summarizer works. Does anyone have examples that show mutl task types (eg: Q&A + Summarizer) in the same model training?

CUIGuy · December 23, 2023, 7:10am

would like to know as well

OrianeN · February 26, 2024, 1:20pm

I’m facing something similar: passing optim=“adafactor”, and with or without setting a learning_rate (or letting the default being set), each training phase shows a learning rate of “0.0” so my model is never updating.

laelhalawani · October 17, 2024, 10:49am

Hi everyone, I’m mt5-small a go locally with 32k samples of varying length up to 512 tokens. Not super familiar with training this model.
Got 2x12gb rtx.

Previously managed to successfully finetune MarianMT models on this hardware, and the training was quite stable with 32 gradient accumulation steps, AdamW and 5e-7 learning rate for 24 epochs, and it performs well on the task it was finetuned on.

I’ve tried the same settings, except the adjusted learning rate to the advised 3e-4 with mt5-small and I’ve got some concerns, perhaps some of you could help to address

training appears to be going much slower
why can’t i fit base size on 2x12gb, with sample size of up to 512 tokens
gradient norm has ridiculously high values
evals metrics have really low values

image1553×726 60 KB

I’m using the same dataset, and the same training procedure as for MarianMT (except for learning rate). Why are the differences so large?

Here are some code snippets that might be relevant from my tuner class

@valhalla @sshleifer should I instead append the eos token to the end of each label and return the str labels instead of tokens? Where do you get the task prefix?
Is there a list of tax prefixes used by google for mt5? I guess it would be easier if I plugged in to an existing prefix.


    def _preprocess_function(self, examples):
        """
        Tokenize and preprocess a batch of examples for model training.

        """
        # Add task-specific prefix
        prefix = f"<{self.src_key}2{self.mt_key}>"
        inputs = [prefix + text for text in examples[self.src_key]]
        targets = examples[self.mt_key]
        model_inputs = self.tokenizer(inputs, max_length=self.samples_filter_max_tokens, padding='max_length', truncation=True)

        with self.tokenizer.as_target_tokenizer():
            labels = self.tokenizer(targets, max_length=self.samples_filter_max_tokens, padding='max_length', truncation=True)

        # Replace pad token ID with -100 to ignore in loss computation
        labels["input_ids"] = [
            [(label if label != self.tokenizer.pad_token_id else -100) for label in labels_seq]
            for labels_seq in labels["input_ids"]
        ]

        model_inputs['labels'] = labels['input_ids']
        return model_inputs

@moscow25 @valhalla @sshleifer @mrm8488 should I just set adafactor=True in my Seq2SeqTrainingArguments?
What about fp16, anyone had success?
What about multiply_by_parametr_scale=True is this something I have to additionaly configure, or is it a default value?

    def run(self):
        """
        Execute the main training loop for fine-tuning the model.

        """
        self.setup()  # Ensure setup is complete, loads model,  datasets

        # Preprocess datasets
        self.log.info("Tokenizing training dataset...")
        tokenized_train_dataset = self.train_dataset.map(self._preprocess_function, batched=True)

        self.log.info("Tokenizing validation dataset...")
        tokenized_eval_dataset = self.eval_dataset.map(self._preprocess_function, batched=True)

        data_collator = DataCollatorForSeq2Seq(
            tokenizer=self.tokenizer,
            model=self.model,
            label_pad_token_id=-100, 
        )

        training_args = Seq2SeqTrainingArguments(
            output_dir=self.output_dir,
            logging_steps=self.logging_steps,
            save_steps=self.save_steps,
            eval_steps=self.eval_steps,
            eval_strategy="steps",
            predict_with_generate=True,
            report_to="wandb" if not self.is_test_run else [],
            metric_for_best_model=self.metric_for_best_model,
            greater_is_better=True,
            load_best_model_at_end=True,
            save_total_limit=self.save_total_limit,
            learning_rate=self.learning_rate,
            weight_decay=self.weight_decay,
            per_device_train_batch_size=self.per_device_train_batch_size,
            per_device_eval_batch_size=self.per_device_eval_batch_size,
            gradient_accumulation_steps=self.gradient_accumulation_steps,
            auto_find_batch_size=self.auto_find_batch_size,
            num_train_epochs=self.num_train_epochs,
            run_name=f"{self.run_name}",
            seed=self.seed,
            fp16=True,  # Enable mixed precision if supported
        )
        trainer = Seq2SeqTrainer(
            model=self.model,
            args=training_args,
            compute_metrics=self._compute_mt_metrics,
            train_dataset=tokenized_train_dataset,
            eval_dataset=tokenized_eval_dataset,
            tokenizer=self.tokenizer,
            data_collator=data_collator,
        )

        trainer.train()

Kind thanks!

laelhalawani · October 17, 2024, 12:40pm

Since this is super old thread I opened a follow up here Finetuning mT5 for specific language pair

michaelcyshield · November 3, 2024, 2:34pm

Hello, I am trying to fine tune t5 for translating a certain dialect, is there a notebook built with all these tips in mind that I can utilize? and what is the state of training on fp16 today? any help would be greatly appreciated especially with what prefix to use and what to end the labels with, Basically if there is starter code that would be wonderful.

Topic		Replies	Views
Issue with finetuning a seq-to-seq model 🤗Transformers	30	3962	August 11, 2022
Finetuning mT5 for specific language pair Models	0	156	October 17, 2024
T5 fp16 issue is fixed 🤗Transformers	18	15265	June 20, 2024
Training T5 on mlm task from scratch 🤗Transformers	4	3279	July 29, 2022
mT5/T5v1.1 Fine-Tuning Results Models	16	7502	March 8, 2022

T5 Finetuning Tips

Related topics