T5 Finetuning Tips

jsrozner · December 9, 2020, 7:14pm

Re Adafactor, I want to confirm that based on the discussion above, that when using HF, we would just have

optimizer = Adafactor(model.parameters(), relative_step=True, warmup_init=True)
scheduler = None

Since, based on the HF implementation of Adafactor, in order to use warmup_init, relative_step must be true, which in turn means that lr must be None.

(I did get very fast convergence using these settings compared to ADAM.)

Other question on “SEP” tokens:
The T5 model doesn’t have a SEP token; instead they do things like

<task prefix> hypothesis: <text> premise: <text>

In this case the model should learn that “premise:” functions as a “SEP” right?

yusukemori · December 22, 2020, 3:39am

Hello,

I’m sorry for asking such a stupid question. I’m having trouble with fine-tuning on T5/mT5, and I’m hoping for your help.

I’m trying to do fine-tuning using the pre-trained t5-base, t5-large, mt5-base, etc., but it seems to generate target sentences with many extra tokens, such as <extra_id_0>, <extra_id_1>, and <extra_id_2> and more. This is especially noticeable in the case when I use t5-large.

I’m using the --fp16 option, and the dataset size is 10K<n<100K.

The training parameters are almost the same as those of Seq2SeqTrainer in transformers v3.4.0 and v4.0.0-rc-1.
I have tried both with and without prefix and have not had good results with either.

I’m not sure if it’s a matter of adjusting the parameters or pre-processing datasets, and I’m wondering where to start debugging my code.

I would be grateful for your advice.

valhalla · December 22, 2020, 6:32am

Hi @yusukemori,

There were some issues with --fp16 for T5, I don’t think it’s fixed yet, that could be one of the reasons for this problem.

yusukemori · December 22, 2020, 8:09am

Hi @valhalla,

Thank you for your advice!
I’ll try fine-tuning T5 without using --fp16, and check how the output becomes.

yusukemori · December 23, 2020, 4:04pm

Hi,

Sorry for the frequent posts.

I tried fine-tuning T5 without --fp16 option, and the results seem to be better than when I used the option.
However, it still tends to generate longer sentences than with other Seq2SeqLMs (e.g. BART-large), and extra tokens are still generated. In particular, <extra_id_0> is generated at the beginning of the sentence.
Is this something that can be avoided by properly choosing model.config.task_specific_params or something similar?

Thank you.

valhalla · December 24, 2020, 7:34am

not sure about why it’s generating extra id,

Yes, you could try different values for generate arguments to control the length. Specifically you could use the length_penalty argument. Set to values < 1.0 in order to encourage the model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer sequences.

Be default generate will use arguments from config or config.task_specific_params but you could also directly pass these args to generate to override them

yusukemori · December 24, 2020, 1:23pm

Thank you, I’ll try changing length_penalty to solve the problem.
Thanks for letting me know that it is possible to override config or config.task_specific_params by directly passing the args.

valhalla · January 12, 2021, 12:11pm

Hey there, you guys might find this interesting, we have just fixed the fp16 issue for some of the T5 models

More details below

rohankhrn56 · April 7, 2021, 10:41am

I was working on an interesting problem of generating inferences from the excel data. I wrote a python program to generate rules from the data in the form of RDF Triple and now training using T5-Base model. with some 10k training data of rdf rules and inferences I was able to get some 80% to 85% test accuracy. I’m using ADAMW optimizer with lr of 1e-5.
One issue I have seen is the model is not able to generalize well on new numbers. eg. if I pass a rule of “Critical” | priority_ticketshare | “23.09%” to the model, it return inference as Critical priority tickets accounted for 22.09% of total tickets. While the statement is correct, the number it has taken is wrong. Any idea how to solve this

dheeraja486 · May 11, 2021, 5:21pm

Hi,
Just have some confusion, regarding how this models are trained on a very large documents. I was trying to train a text generation task with output a whole book. How much GPU memory will I need and if i working on 16GB GPU how much tokens I can fit. My input length is max 20 tokens.
Any help will be appreciated.

Kirill · June 4, 2021, 9:16pm

Hi,
Can you please share some experience or tips how to train/test long sequences correctly with T5/mT5?
I need to generate some 100-200 word text based on several 3-7 input keywords on different languages.
Should I look at length_penalty while training or is it used only for inference time? If you have any more tips, can you share them please? Thank you.

PhillBoy · August 3, 2021, 8:44pm

I have a question about sample_weights. Typically, you can pass in the sample_weights as the third element of a tuple when constructing the Tensorflow Dataset (Training and evaluation with the built-in methods | TensorFlow Core). However, for the class T5ForConditionalGeneration, the call method (which I assume is what is called on the model is called) only takes the following parameters:

   def call(
    self,
    input_ids=None,
    attention_mask=None,
    decoder_input_ids=None,
    decoder_attention_mask=None,
    head_mask=None,
    decoder_head_mask=None,
    encoder_outputs=None,
    past_key_values=None,
    inputs_embeds=None,
    decoder_inputs_embeds=None,
    labels=None,
    use_cache=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
    training=False,
    **kwargs,
):

Source: transformers.models.t5.modeling_tf_t5 — transformers 4.7.0 documentation

I don’t see a way for T5 to consider the sample weights. How do I pass in the sample weights to a T5ForConditionalGeneration model? Thanks!

brando · August 5, 2021, 7:24pm

How is the AdafactorSchedule supposed to be used? in particular how often do I call the scheduler? e.g. every epoch or maybe every 1000 steps?

zokica · November 16, 2021, 7:14pm

Lets say I fine-tune model1 with some data and params for parphrasing, and i also train model2 with the same data and params for paraphasing.

And when generate using model1 and model2 should the results be the same?

Thanks

pierreguillou · December 9, 2021, 3:18pm

Hi @moscow25,

I’m training a T5 base (the original version, not the T5 v1.1) in AWS SageMaker & a HF Training DLC. I read your post about AdaFactor and the HF doc about it (AdaFactor (PyTorch)).

The following code comes from the HF doc and seems to match your post:

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)

Then, I searched how to implement it in the HF existring scripts (run_translation.py and run_summarization.py) without changing the code of these scripts.

I discovered that the Seq2SeqTrainingArguments had an argument for that: adafactor.

By passing adafactor = True, it changes the optimizer from AdamW to AdaFactor in the following line of the Trainer:

optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
   optimizer_cls = Adafactor
   optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
   optimizer_cls = AdamW
   optimizer_kwargs = {
                    "betas": (self.args.adam_beta1, self.args.adam_beta2),
                    "eps": self.args.adam_epsilon,
                }
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
   self.optimizer = OSS(
                    params=optimizer_grouped_parameters,
                    optim=optimizer_cls,
                    **optimizer_kwargs,
                )
else:
   self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

The consequence is a modification of 2 arguments ("scale_parameter": False, "relative_step": False) of the AdaFactor (check the default parameters here).

And if you pass to the Seq2SeqTrainingArguments the argument learning_rate = 1e-3, you get exactly the code optimizer = Adafactor(...) printed at the top of this post.

Note: by passing learning_rate = 1e-3, you do not need to change the lr_scheduler with the following code, right?

lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

I did use this (ie, adafactor = True) in AWS SageMaker & HF Training DLC (cc @philschmid) but in the logs in CloudWatch, the printed learning rate was always 0 and the eval_loss was always exactly the same (a high number) at each evaluation. What was wrong?

Note: I found this blog post ([Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost) that says:

Notes: For the original T5 pre-trained models, which were pre-trained with a mixture of unsupervised and supervised objectives, Adam or AdamW optimizers are enough to get good results.

Then, I did a training of my original T5 base (with the script run_translation.py) on AWs SageMaker & HF Training DLC with the argument adafactor = False ('ie, optimizer AdamW) and a learning_rate = 1e-4 (even 5e-5) that did work.

What do you think of that? AdaFactor com HF implementation works only with T5 v1.1, mT5 and ByT5? Not with the original version of T5?

lersouza · February 1, 2022, 6:50pm

Hey all!

Just to share some results. I finetuned the mT5-small (google/mt5-small) model on XNLI using Pytorch + Pytorch Lightning with following parameters:

Huggingface Adafactor, lr = 5e-4, no schedulers, with both scale_parameter and relative_step set to False.
Sequence Length = 256 (trimmed by batch), Batch Size = 32, with gradient accumulation of 4.
GPU = Tesla P100
Validations every 20% of epoch
Training on XNLI English Set (datasets lib), validating on all_languages and averaging results. Results reported on validation set.

I got 65.17% average accuracy across all languages. In mT5 paper they report 67.5%. Could anyone reproduce those results?

Thanks guys!

Onlydrinkwater · April 2, 2022, 5:39pm

I used adafactor with the hyperparameter you suggested, but it seems that the T5 is overfitting.
I use 100k pairs of sentences(wmt14de-en) to train the model

IMAYK · November 10, 2022, 6:20pm

I’m finetuning t5 large for text2sql using a batch size of 2, and gradient accumulation steps to 600. I’m training it on RTX A6000.
Currently, it is showing ~1700/it. Is this normal? If not, how should I proceed?
I’m using the finetuning code from here and made changes to the data pre-processing steps only.

ndvb · February 14, 2023, 11:09am

I have a problem running Adafactor using the trainer.
When writing my own training loop - all works well with Adafactor.
When using trainer, with constant learning rate - all works well.
When I try to use the Trainer with Adafactor - it prints that the learning rate at each step is 0, and naturally the training error is not decreased. Here’s what I do:


optimizer = Adafactor(model.parameters(), lr=0.001, eps=(1e-30, 1e-3), clip_threshold=1.0, decay_rate=-0.8,
						  beta1=None, weight_decay=0.0, scale_parameter=False, relative_step=False,
						  warmup_init=False)					  
						  
lr_scheduler = AdafactorSchedule(optimizer)  
training_args = TrainingArguments(
		optim='adafactor',
		...
		  )
										  
 trainer = Trainer(model=model,
		  args=training_args,
		  train_dataset=training_set,
		  eval_dataset=val_dataset,
		  tokenizer=tokenizer		  		  
		  optimizers=(optimizer, lr_scheduler),
		  )

What am I missing? Should the optimizer be passed to the optimizers param of the Trainer or to the TrainingArguments as optim? Or to both? This is a bit confusing.

shannon · February 15, 2023, 8:52am

I have problems finetuning T5 on text classification task.
I have 10 labels, but only 3 labels have training data. I finetune the T5 model on the training data(only 3 labels), then I validate it on the rest 7 labels. The perfomance is worse than the zeroshot result of
unfinetuned T5. So I am confused that if I finetune T5, will the model lose the zeroshot ability?

Topic		Replies	Views
Finetuning T5 for a task Intermediate	21	6929	September 3, 2022
Finetuning T5 on translation task 🤗Transformers	0	490	September 10, 2021
Does task specific prefix matters for T5 fine-tuning? Beginners	9	7295	June 28, 2021
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	629	October 14, 2020
Finetuning mT5 for specific language pair Models	0	144	October 17, 2024

T5 Finetuning Tips

Related topics