T5 Gen Len is only 1/14 of max_target_length

Following this HuggingFace Google Colab Page, I fine-tuned t5-small for text summary.
Did some inferencing with some examples and it does some good job creating one sentence summary.

But the problem is Gen Len.

While training I set the max_target_length to be 128. And also average number of words of summary in dataset is 54.

But Gen Len is only being 18. Can anyone know what would be the cause of this?
Thanks

Here’s full training result:

Epoch Training Loss Validation Loss Rouge1 Rouge2 Rougel Rougelsum Gen Len
1 0.692200 0.640369 35.381500 26.480100 33.188200 34.246600 18.976800
2 0.576300 0.613857 35.903000 27.494000 33.932100 34.904800 18.912500
3 0.451700 0.587891 37.067900 28.700700 35.132400 36.064400 18.910700
4 0.352300 0.587816 37.168600 29.102700 35.183600 36.261000 18.912500
1 Like

I’m seeing something similar when trying to train a BART model…It only produces predictions of 20 tokens which seems to be the default for the generate function, but no matter what config parameter i set (max_length, max_token_length etc.) during the compute_metrics stage of a Trainer loop, it only produces predictions of 20 tokens…

2 Likes

I found the reason for my problem. During finetuning I only produce labels of 20 tokens (which is some sort of default I assume), so that is what the model learns…Only produce summaries of 20 tokens long, and it was caused by my preprocessing tokenization step of my training data.

In my tokenization step for the labels during finetuning, I used this as my tokenization step:

            tokenized_targets = self.tokenizer(
                text_target=batch[self.label_column],
                max_length=self.model.config.max_token_length,
                **self.tokenizer_config,
                **kwargs,
            )

But when I change it to this, it does produce longer labels than 20 tokens. Not sure why I need to provide the batch[self.label_column] as an input to the tokenizer, effectively twice, but this way it does “listen” to the max_length (which I set in the model.config in my class to 128)…

            tokenized_targets = self.tokenizer(
                batch[self.label_column], # Added this to make it work
                text_target=batch[self.label_column],
                max_length=self.model.config.max_token_length,
                **self.tokenizer_config,
                **kwargs,
            )

Hope that helps you since your generated lenght is awefully close to 20 and there may be some that are shorter, hence your avarage lenght < 20…

Also, not sure where your argument max_target_length is set, but max_length (or, for V5 version it will be max_token_length) is the correct argument for the tokenizer, so you may also have to look at that.

2 Likes

I have found the solution to this. It is because of the generation_max_length parameter in the training args.

generation_max_length (int, optional) — The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration.

Please refer this section: Trainer

There search for " generation_max_length".

There also is GenerationConfig that you can use. But this generation_max_length will solve the issue :hugs: