T5 Finetuning Tips

pierreguillou · December 9, 2021, 3:18pm

I’m training a T5 base (the original version, not the T5 v1.1) in AWS SageMaker & a HF Training DLC. I read your post about AdaFactor and the HF doc about it (AdaFactor (PyTorch)).

The following code comes from the HF doc and seems to match your post:

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)

Then, I searched how to implement it in the HF existring scripts (run_translation.py and run_summarization.py) without changing the code of these scripts.

I discovered that the Seq2SeqTrainingArguments had an argument for that: adafactor.

By passing adafactor = True, it changes the optimizer from AdamW to AdaFactor in the following line of the Trainer:

optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
   optimizer_cls = Adafactor
   optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
   optimizer_cls = AdamW
   optimizer_kwargs = {
                    "betas": (self.args.adam_beta1, self.args.adam_beta2),
                    "eps": self.args.adam_epsilon,
                }
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
   self.optimizer = OSS(
                    params=optimizer_grouped_parameters,
                    optim=optimizer_cls,
                    **optimizer_kwargs,
                )
else:
   self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

The consequence is a modification of 2 arguments ("scale_parameter": False, "relative_step": False) of the AdaFactor (check the default parameters here).

And if you pass to the Seq2SeqTrainingArguments the argument learning_rate = 1e-3, you get exactly the code optimizer = Adafactor(...) printed at the top of this post.

Note: by passing learning_rate = 1e-3, you do not need to change the lr_scheduler with the following code, right?

lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

I did use this (ie, adafactor = True) in AWS SageMaker & HF Training DLC (cc @philschmid) but in the logs in CloudWatch, the printed learning rate was always 0 and the eval_loss was always exactly the same (a high number) at each evaluation. What was wrong?

Note: I found this blog post ([Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost) that says:

Notes: For the original T5 pre-trained models, which were pre-trained with a mixture of unsupervised and supervised objectives, Adam or AdamW optimizers are enough to get good results.

Then, I did a training of my original T5 base (with the script run_translation.py) on AWs SageMaker & HF Training DLC with the argument adafactor = False ('ie, optimizer AdamW) and a learning_rate = 1e-4 (even 5e-5) that did work.

What do you think of that? AdaFactor com HF implementation works only with T5 v1.1, mT5 and ByT5? Not with the original version of T5?

Topic		Replies	Views
Finetuning T5 for a task Intermediate	21	6890	September 3, 2022
Finetuning T5 on translation task 🤗Transformers	0	488	September 10, 2021
Does task specific prefix matters for T5 fine-tuning? Beginners	9	7278	June 28, 2021
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	626	October 14, 2020
Finetuning mT5 for specific language pair Models	0	137	October 17, 2024

T5 Finetuning Tips

Related topics