Hi @moscow25,
I’m training a T5 base (the original version, not the T5 v1.1) in AWS SageMaker & a HF Training DLC. I read your post about AdaFactor and the HF doc about it (AdaFactor (PyTorch)).
The following code comes from the HF doc and seems to match your post:
optimizer = Adafactor(
model.parameters(),
lr=1e-3,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.0,
relative_step=False,
scale_parameter=False,
warmup_init=False
)
Then, I searched how to implement it in the HF existring scripts (run_translation.py and run_summarization.py) without changing the code of these scripts.
I discovered that the Seq2SeqTrainingArguments had an argument for that: adafactor
.
By passing adafactor = True
, it changes the optimizer from AdamW
to AdaFactor
in the following line of the Trainer
:
optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
optimizer_cls = Adafactor
optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
optimizer_cls = AdamW
optimizer_kwargs = {
"betas": (self.args.adam_beta1, self.args.adam_beta2),
"eps": self.args.adam_epsilon,
}
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
self.optimizer = OSS(
params=optimizer_grouped_parameters,
optim=optimizer_cls,
**optimizer_kwargs,
)
else:
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
The consequence is a modification of 2 arguments ("scale_parameter": False, "relative_step": False
) of the AdaFactor
(check the default parameters here).
And if you pass to the Seq2SeqTrainingArguments the argument learning_rate = 1e-3
, you get exactly the code optimizer = Adafactor(...)
printed at the top of this post.
Note: by passing learning_rate = 1e-3
, you do not need to change the lr_scheduler
with the following code, right?
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))
I did use this (ie, adafactor = True
) in AWS SageMaker & HF Training DLC (cc @philschmid) but in the logs in CloudWatch, the printed learning rate
was always 0 and the eval_loss
was always exactly the same (a high number) at each evaluation. What was wrong?
Note: I found this blog post ([Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost) that says:
Notes: For the original T5 pre-trained models, which were pre-trained with a mixture of unsupervised and supervised objectives, Adam or AdamW optimizers are enough to get good results.
Then, I did a training of my original T5 base (with the script run_translation.py) on AWs SageMaker & HF Training DLC with the argument adafactor = False
('ie, optimizer AdamW
) and a learning_rate = 1e-4
(even 5e-5) that did work.
What do you think of that? AdaFactor com HF implementation works only with T5 v1.1, mT5 and ByT5? Not with the original version of T5?