T5 Gen Len is only 1/14 of max_target_length

Seungjun · July 6, 2023, 4:15pm

Following this HuggingFace Google Colab Page, I fine-tuned t5-small for text summary.
Did some inferencing with some examples and it does some good job creating one sentence summary.

But the problem is Gen Len.

While training I set the max_target_length to be 128. And also average number of words of summary in dataset is 54.

But Gen Len is only being 18. Can anyone know what would be the cause of this?
Thanks

Here’s full training result:

Epoch	Training Loss	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Gen Len
1	0.692200	0.640369	35.381500	26.480100	33.188200	34.246600	18.976800
2	0.576300	0.613857	35.903000	27.494000	33.932100	34.904800	18.912500
3	0.451700	0.587891	37.067900	28.700700	35.132400	36.064400	18.910700
4	0.352300	0.587816	37.168600	29.102700	35.183600	36.261000	18.912500

thondeboer · July 6, 2023, 6:46pm

I’m seeing something similar when trying to train a BART model…It only produces predictions of 20 tokens which seems to be the default for the generate function, but no matter what config parameter i set (max_length, max_token_length etc.) during the compute_metrics stage of a Trainer loop, it only produces predictions of 20 tokens…

thondeboer · July 6, 2023, 7:27pm

I found the reason for my problem. During finetuning I only produce labels of 20 tokens (which is some sort of default I assume), so that is what the model learns…Only produce summaries of 20 tokens long, and it was caused by my preprocessing tokenization step of my training data.

In my tokenization step for the labels during finetuning, I used this as my tokenization step:

            tokenized_targets = self.tokenizer(
                text_target=batch[self.label_column],
                max_length=self.model.config.max_token_length,
                **self.tokenizer_config,
                **kwargs,
            )

But when I change it to this, it does produce longer labels than 20 tokens. Not sure why I need to provide the batch[self.label_column] as an input to the tokenizer, effectively twice, but this way it does “listen” to the max_length (which I set in the model.config in my class to 128)…

            tokenized_targets = self.tokenizer(
                batch[self.label_column], # Added this to make it work
                text_target=batch[self.label_column],
                max_length=self.model.config.max_token_length,
                **self.tokenizer_config,
                **kwargs,
            )

Hope that helps you since your generated lenght is awefully close to 20 and there may be some that are shorter, hence your avarage lenght < 20…

Also, not sure where your argument max_target_length is set, but max_length (or, for V5 version it will be max_token_length) is the correct argument for the tokenizer, so you may also have to look at that.

AayushShah · October 5, 2023, 7:19am

I have found the solution to this. It is because of the generation_max_length parameter in the training args.

generation_max_length (int, optional) — The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration.

Please refer this section: Trainer

There search for " generation_max_length".

There also is GenerationConfig that you can use. But this generation_max_length will solve the issue

Topic		Replies	Views
T5 Generates very short summaries 🤗Transformers	22	5553	September 11, 2020
[T5] How to control the lenth of the generated summaries 🤗Transformers	0	34	July 26, 2024
Does generate's max_length influence training? 🤗Transformers	0	103	April 25, 2024
T5 tokenizer / ideal method of calculating max_sequence_length? 🤗Transformers	1	542	May 22, 2024
How to increase the length of the summary in Bart_large_cnn model used via transformers.Auto_Model_frompretrained? Beginners	1	999	November 15, 2021

T5 Gen Len is only 1/14 of max_target_length

Related topics