T5 Finetuning Tips

Hi @moscow25,

I’m training a T5 base (the original version, not the T5 v1.1) in AWS SageMaker & a HF Training DLC. I read your post about AdaFactor and the HF doc about it (AdaFactor (PyTorch)).

The following code comes from the HF doc and seems to match your post:

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)

Then, I searched how to implement it in the HF existring scripts (run_translation.py and run_summarization.py) without changing the code of these scripts.

I discovered that the Seq2SeqTrainingArguments had an argument for that: adafactor.

By passing adafactor = True, it changes the optimizer from AdamW to AdaFactor in the following line of the Trainer:

optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
   optimizer_cls = Adafactor
   optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
   optimizer_cls = AdamW
   optimizer_kwargs = {
                    "betas": (self.args.adam_beta1, self.args.adam_beta2),
                    "eps": self.args.adam_epsilon,
                }
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
   self.optimizer = OSS(
                    params=optimizer_grouped_parameters,
                    optim=optimizer_cls,
                    **optimizer_kwargs,
                )
else:
   self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

The consequence is a modification of 2 arguments ("scale_parameter": False, "relative_step": False) of the AdaFactor (check the default parameters here).

And if you pass to the Seq2SeqTrainingArguments the argument learning_rate = 1e-3, you get exactly the code optimizer = Adafactor(...) printed at the top of this post.

Note: by passing learning_rate = 1e-3, you do not need to change the lr_scheduler with the following code, right?

lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

I did use this (ie, adafactor = True) in AWS SageMaker & HF Training DLC (cc @philschmid) but in the logs in CloudWatch, the printed learning rate was always 0 and the eval_loss was always exactly the same (a high number) at each evaluation. What was wrong?

Note: I found this blog post ([Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost) that says:

Notes: For the original T5 pre-trained models, which were pre-trained with a mixture of unsupervised and supervised objectives, Adam or AdamW optimizers are enough to get good results.

Then, I did a training of my original T5 base (with the script run_translation.py) on AWs SageMaker & HF Training DLC with the argument adafactor = False ('ie, optimizer AdamW) and a learning_rate = 1e-4 (even 5e-5) that did work.

What do you think of that? AdaFactor com HF implementation works only with T5 v1.1, mT5 and ByT5? Not with the original version of T5?

4 Likes