T5 Finetuning Tips

Re Adafactor, I want to confirm that based on the discussion above, that when using HF, we would just have

optimizer = Adafactor(model.parameters(), relative_step=True, warmup_init=True)
scheduler = None

Since, based on the HF implementation of Adafactor, in order to use warmup_init, relative_step must be true, which in turn means that lr must be None.

(I did get very fast convergence using these settings compared to ADAM.)

Other question on “SEP” tokens:
The T5 model doesn’t have a SEP token; instead they do things like

<task prefix> hypothesis: <text> premise: <text>

In this case the model should learn that “premise:” functions as a “SEP” right?

1 Like

Hello,

I’m sorry for asking such a stupid question. I’m having trouble with fine-tuning on T5/mT5, and I’m hoping for your help.

I’m trying to do fine-tuning using the pre-trained t5-base, t5-large, mt5-base, etc., but it seems to generate target sentences with many extra tokens, such as <extra_id_0>, <extra_id_1>, and <extra_id_2> and more. This is especially noticeable in the case when I use t5-large.

I’m using the --fp16 option, and the dataset size is 10K<n<100K.

The training parameters are almost the same as those of Seq2SeqTrainer in transformers v3.4.0 and v4.0.0-rc-1.
I have tried both with and without prefix and have not had good results with either.

I’m not sure if it’s a matter of adjusting the parameters or pre-processing datasets, and I’m wondering where to start debugging my code.

I would be grateful for your advice.

Hi @yusukemori,

There were some issues with --fp16 for T5, I don’t think it’s fixed yet, that could be one of the reasons for this problem.

1 Like

Hi @valhalla,

Thank you for your advice!
I’ll try fine-tuning T5 without using --fp16, and check how the output becomes.

Hi,

Sorry for the frequent posts.

I tried fine-tuning T5 without --fp16 option, and the results seem to be better than when I used the option.
However, it still tends to generate longer sentences than with other Seq2SeqLMs (e.g. BART-large), and extra tokens are still generated. In particular, <extra_id_0> is generated at the beginning of the sentence.
Is this something that can be avoided by properly choosing model.config.task_specific_params or something similar?

Thank you.

not sure about why it’s generating extra id,

Yes, you could try different values for generate arguments to control the length. Specifically you could use the length_penalty argument. Set to values < 1.0 in order to encourage the model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer sequences.

Be default generate will use arguments from config or config.task_specific_params but you could also directly pass these args to generate to override them

1 Like

Thank you, I’ll try changing length_penalty to solve the problem.
Thanks for letting me know that it is possible to override config or config.task_specific_params by directly passing the args.

Hey there, you guys might find this interesting, we have just fixed the fp16 issue for some of the T5 models

More details below

5 Likes

I was working on an interesting problem of generating inferences from the excel data. I wrote a python program to generate rules from the data in the form of RDF Triple and now training using T5-Base model. with some 10k training data of rdf rules and inferences I was able to get some 80% to 85% test accuracy. I’m using ADAMW optimizer with lr of 1e-5.
One issue I have seen is the model is not able to generalize well on new numbers. eg. if I pass a rule of “Critical” | priority_ticketshare | “23.09%” to the model, it return inference as Critical priority tickets accounted for 22.09% of total tickets. While the statement is correct, the number it has taken is wrong. Any idea how to solve this

Hi,
Just have some confusion, regarding how this models are trained on a very large documents. I was trying to train a text generation task with output a whole book. How much GPU memory will I need and if i working on 16GB GPU how much tokens I can fit. My input length is max 20 tokens.
Any help will be appreciated.

Hi,
Can you please share some experience or tips how to train/test long sequences correctly with T5/mT5?
I need to generate some 100-200 word text based on several 3-7 input keywords on different languages.
Should I look at length_penalty while training or is it used only for inference time? If you have any more tips, can you share them please? Thank you.

I have a question about sample_weights. Typically, you can pass in the sample_weights as the third element of a tuple when constructing the Tensorflow Dataset (Training and evaluation with the built-in methods  |  TensorFlow Core). However, for the class T5ForConditionalGeneration, the call method (which I assume is what is called on the model is called) only takes the following parameters:

   def call(
    self,
    input_ids=None,
    attention_mask=None,
    decoder_input_ids=None,
    decoder_attention_mask=None,
    head_mask=None,
    decoder_head_mask=None,
    encoder_outputs=None,
    past_key_values=None,
    inputs_embeds=None,
    decoder_inputs_embeds=None,
    labels=None,
    use_cache=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
    training=False,
    **kwargs,
):

Source: transformers.models.t5.modeling_tf_t5 — transformers 4.7.0 documentation

I don’t see a way for T5 to consider the sample weights. How do I pass in the sample weights to a T5ForConditionalGeneration model? Thanks!

1 Like

How is the AdafactorSchedule supposed to be used? in particular how often do I call the scheduler? e.g. every epoch or maybe every 1000 steps?

Lets say I fine-tune model1 with some data and params for parphrasing, and i also train model2 with the same data and params for paraphasing.

And when generate using model1 and model2 should the results be the same?

Thanks

Hi @moscow25,

I’m training a T5 base (the original version, not the T5 v1.1) in AWS SageMaker & a HF Training DLC. I read your post about AdaFactor and the HF doc about it (AdaFactor (PyTorch)).

The following code comes from the HF doc and seems to match your post:

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)

Then, I searched how to implement it in the HF existring scripts (run_translation.py and run_summarization.py) without changing the code of these scripts.

I discovered that the Seq2SeqTrainingArguments had an argument for that: adafactor.

By passing adafactor = True, it changes the optimizer from AdamW to AdaFactor in the following line of the Trainer:

optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
   optimizer_cls = Adafactor
   optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
   optimizer_cls = AdamW
   optimizer_kwargs = {
                    "betas": (self.args.adam_beta1, self.args.adam_beta2),
                    "eps": self.args.adam_epsilon,
                }
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
   self.optimizer = OSS(
                    params=optimizer_grouped_parameters,
                    optim=optimizer_cls,
                    **optimizer_kwargs,
                )
else:
   self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

The consequence is a modification of 2 arguments ("scale_parameter": False, "relative_step": False) of the AdaFactor (check the default parameters here).

And if you pass to the Seq2SeqTrainingArguments the argument learning_rate = 1e-3, you get exactly the code optimizer = Adafactor(...) printed at the top of this post.

Note: by passing learning_rate = 1e-3, you do not need to change the lr_scheduler with the following code, right?

lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

I did use this (ie, adafactor = True) in AWS SageMaker & HF Training DLC (cc @philschmid) but in the logs in CloudWatch, the printed learning rate was always 0 and the eval_loss was always exactly the same (a high number) at each evaluation. What was wrong?

Note: I found this blog post ([Paper] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost) that says:

Notes: For the original T5 pre-trained models, which were pre-trained with a mixture of unsupervised and supervised objectives, Adam or AdamW optimizers are enough to get good results.

Then, I did a training of my original T5 base (with the script run_translation.py) on AWs SageMaker & HF Training DLC with the argument adafactor = False ('ie, optimizer AdamW) and a learning_rate = 1e-4 (even 5e-5) that did work.

What do you think of that? AdaFactor com HF implementation works only with T5 v1.1, mT5 and ByT5? Not with the original version of T5?

4 Likes

Hey all!

Just to share some results. I finetuned the mT5-small (google/mt5-small) model on XNLI using Pytorch + Pytorch Lightning with following parameters:

  • Huggingface Adafactor, lr = 5e-4, no schedulers, with both scale_parameter and relative_step set to False.
  • Sequence Length = 256 (trimmed by batch), Batch Size = 32, with gradient accumulation of 4.
  • GPU = Tesla P100
  • Validations every 20% of epoch
  • Training on XNLI English Set (datasets lib), validating on all_languages and averaging results. Results reported on validation set.

I got 65.17% average accuracy across all languages. In mT5 paper they report 67.5%. Could anyone reproduce those results?

Thanks guys!

3 Likes


I used adafactor with the hyperparameter you suggested, but it seems that the T5 is overfitting.
I use 100k pairs of sentences(wmt14de-en) to train the model

I’m finetuning t5 large for text2sql using a batch size of 2, and gradient accumulation steps to 600. I’m training it on RTX A6000.
Currently, it is showing ~1700/it. Is this normal? If not, how should I proceed?
I’m using the finetuning code from here and made changes to the data pre-processing steps only.

1 Like

I have a problem running Adafactor using the trainer.
When writing my own training loop - all works well with Adafactor.
When using trainer, with constant learning rate - all works well.
When I try to use the Trainer with Adafactor - it prints that the learning rate at each step is 0, and naturally the training error is not decreased. Here’s what I do:


optimizer = Adafactor(model.parameters(), lr=0.001, eps=(1e-30, 1e-3), clip_threshold=1.0, decay_rate=-0.8,
						  beta1=None, weight_decay=0.0, scale_parameter=False, relative_step=False,
						  warmup_init=False)					  
						  
lr_scheduler = AdafactorSchedule(optimizer)  
training_args = TrainingArguments(
		optim='adafactor',
		...
		  )
										  
 trainer = Trainer(model=model,
		  args=training_args,
		  train_dataset=training_set,
		  eval_dataset=val_dataset,
		  tokenizer=tokenizer		  		  
		  optimizers=(optimizer, lr_scheduler),
		  )

What am I missing? Should the optimizer be passed to the optimizers param of the Trainer or to the TrainingArguments as optim? Or to both? This is a bit confusing.

2 Likes

I have problems finetuning T5 on text classification task.
I have 10 labels, but only 3 labels have training data. I finetune the T5 model on the training data(only 3 labels), then I validate it on the rest 7 labels. The perfomance is worse than the zeroshot result of
unfinetuned T5. So I am confused that if I finetune T5, will the model lose the zeroshot ability?