T5 generate gibberish after finetune 10epochs

Hi all,

with 1e-4 learning rate, the model start to generate some gibberish, the loss start to increase.
This is how I generate:

self.model.generate( input_ids = input_ids, num_beams = num_beams, early_stopping = True, max_length = max_length, no_repeat_ngram_size = 2, min_length=0,repetition_penalty = 1.2)

Any help will be appreciated ty!

x_decoded[:2]:[‘translate English to German:Glandorf said he had had a nice long talk with the Icelander, who will select his squad on Tuesday for the games against Switzerland.’, ‘translate English to German:His ideas and approach to the game have always inspired me, as I am sure they will also inspire my long-term teammates and my successors, too.’]|

pred_decoded[:2]:[Während der Saison hat es aufgrund der heidigen en, die er in sterreich. und stünde, ob ich? - )'m n t g l u r d i v h p c f j k w z y x q / 0! : ; ” – ’…, ‘Die Ideen und das Konzept des Spiels haben mich immer inspiriert, ich bin sicher, dass auch meine Nachfolger meiner langjährigen…’]|

label_decoded[:2]:[‘Mit dem Isländer, der am Dienstag seinen Kader für die Länderspiele gegen die Schweiz bekanntgibt, führte Glandorf ein langes und gutes Gespräch.’, ‘Seine Ideen und Ansätze haben mich genauso begeistert, wie sie auch meine langjährigen Mitspieler und meine Nachfolger begeistern werden.’]

Ten epochs seems like a rather high number for training T5 to me. How are the training and validation losses evolving throughout the training?

In case you don’t already, you could implement some early stopping mechanism in the training so that the training stops if the model starts getting worse, and just load the nest model at the end: Trainer

Hope that heps.


Thanks for your reply Heiko!

The train loss keep decreasing and the validation loss start increase after 10 epochs. I am now using like 2000 sentence pairs to train.

Is 10 epochs enough for fine-tuning T5? May I ask what’s ur learning rate and the number of epochs that you use?

I know it may be overfitting, im just a bit curious about why overfitting will make the model generate gibberish.


A decrease in training loss and an increase in validation loss is a sure sign of your model overfitting. In my experience for Transformer models an epoch number of 2 or 3 is usually sufficient. For example, in this demo a GPT2 model was finetuned to generate Shakespeare style text in just 3 epochs.

With regards to why an overfitted model produces gibberish, I like to use the following analogy, which is probably an oversimplification but gets the point across:

Imagine a first grade student learning about multiplication. Instead of learning the general principle of how to multiply numbers this student memorises a multiplication table, like this one:

This student will perform extraordinary well in multiplication tests whenever the numbers involved are 13 or smaller. However, this student won’t know what to do once it gets the task of, for example, multiplying 14 and 21, because the student never learned the underlying principle. The student then might just answer with a random number because they don’t know what else to do.

A similar thing happens with overfitted models: The model will memorise the training data and when that happens it won’t know what to do with data outside of the training data. This is my best guess why the model then starts producing gibberish.

Hope that helps.


Appreciate that so much!! That question was confusing me, cuz I am training T5 on some synthetic data (generated data) and I thought its the data that cause that issue.

Sorry for the late reply, thought no one will answer that question, thank you so much Heiko:)