Following this HuggingFace Google Colab Page, I fine-tuned t5-small for text summary.
Did some inferencing with some examples and it does some good job creating one sentence summary.
But the problem is Gen Len.
While training I set the max_target_length
to be 128. And also average number of words of summary in dataset is 54.
But Gen Len is only being 18. Can anyone know what would be the cause of this?
Thanks
Here’s full training result:
Epoch |
Training Loss |
Validation Loss |
Rouge1 |
Rouge2 |
Rougel |
Rougelsum |
Gen Len |
1 |
0.692200 |
0.640369 |
35.381500 |
26.480100 |
33.188200 |
34.246600 |
18.976800 |
2 |
0.576300 |
0.613857 |
35.903000 |
27.494000 |
33.932100 |
34.904800 |
18.912500 |
3 |
0.451700 |
0.587891 |
37.067900 |
28.700700 |
35.132400 |
36.064400 |
18.910700 |
4 |
0.352300 |
0.587816 |
37.168600 |
29.102700 |
35.183600 |
36.261000 |
18.912500 |
1 Like
I’m seeing something similar when trying to train a BART model…It only produces predictions of 20 tokens which seems to be the default for the generate function, but no matter what config parameter i set (max_length, max_token_length etc.) during the compute_metrics stage of a Trainer loop, it only produces predictions of 20 tokens…
3 Likes
I found the reason for my problem. During finetuning I only produce labels of 20 tokens (which is some sort of default I assume), so that is what the model learns…Only produce summaries of 20 tokens long, and it was caused by my preprocessing tokenization step of my training data.
In my tokenization step for the labels during finetuning, I used this as my tokenization step:
tokenized_targets = self.tokenizer(
text_target=batch[self.label_column],
max_length=self.model.config.max_token_length,
**self.tokenizer_config,
**kwargs,
)
But when I change it to this, it does produce longer labels than 20 tokens. Not sure why I need to provide the batch[self.label_column]
as an input to the tokenizer, effectively twice, but this way it does “listen” to the max_length (which I set in the model.config in my class to 128)…
tokenized_targets = self.tokenizer(
batch[self.label_column], # Added this to make it work
text_target=batch[self.label_column],
max_length=self.model.config.max_token_length,
**self.tokenizer_config,
**kwargs,
)
Hope that helps you since your generated lenght is awefully close to 20 and there may be some that are shorter, hence your avarage lenght < 20…
Also, not sure where your argument max_target_length
is set, but max_length
(or, for V5 version it will be max_token_length
) is the correct argument for the tokenizer, so you may also have to look at that.
2 Likes
I have found the solution to this. It is because of the generation_max_length
parameter in the training args.
generation_max_length (int
, optional) — The max_length
to use on each evaluation loop when predict_with_generate=True
. Will default to the max_length
value of the model configuration.
Please refer this section: Trainer
There search for " generation_max_length".
There also is GenerationConfig
that you can use. But this generation_max_length
will solve the issue 
1 Like