T5 Finetuning Tips

Did you solve your issue? I think passing the optimizer is enough, you dont need to pass it again as optim=‘adafactor’ in the Training Arguments.

Sharing my results with transfer learning flan-t5-small for translation.

Experiment 1:

Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)
# no scheduler

Experiment 2:

optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)

Got faster convergence with experiment 1, but better end performance with experiment 2:

2 Likes

Hi! @pierreguillou
I tried the method you mentioned, i.e. using adafactor in the Trainer of hugginface transformers to fine tune the original version of T5.
The version of transformers I am using is 4.28.1.

I used the run_translation.py script like you did. The script defaults to AdamW. According to the latest transformers documentation, I use “–optim adafactor” to select adafactor and "–learning_rate 1e-3 " to set the learning_rate to 1e-3.

Basically, this is the same way you are using it. Ultimately, I did not observe the issue you mentioned regarding eval_loss and learning_rate. However, the results I got show that using adafactor is not as good as AdamW, all other parameters being equal. on my own dataset, of course.

I would like to ask, do you have any new findings and tips about using adafactor in Trainer?

Some people have mentioned multi-task fine tuning of T5 and I am wondering if anyone has been successful in fine tuning T5 on different task types. For example, fine tuning Q&A as well as Summarization in the same model. I can train them separately using their corresponding models (eg: AutoModelForQuestionAnswering or AutoModelForSeq2SeqLM). However, when I try to merge the datasets using interleaving or concating, and use T5ForConditionalGeneration to cover both tasks, only the Summarizer works. Does anyone have examples that show mutl task types (eg: Q&A + Summarizer) in the same model training?

2 Likes

would like to know as well

I’m facing something similar: passing optim=“adafactor”, and with or without setting a learning_rate (or letting the default being set), each training phase shows a learning rate of “0.0” so my model is never updating.