T5 Generates very short summaries

marton-avrios · July 14, 2020, 1:33pm

I am trying to generate summaries using t5-small with a maximum target length of 30. My original inputs are german PDF invoices. I run OCR and concatenate the words to create input text. My outputs should be the invoice numbers. However even after 3 days on a V100 I get exactly 200 token long summaries (since epoch 1 or 2 out of 300) and garbage results. Summaries look like someone shuffled the original words a little but they do contain the invoice number somewhere near to the start.

What might cause it to stick to 200 generated tokens?

valhalla · July 14, 2020, 1:41pm

Hi, are you fine-tuning the model or just generating summaries ?

If you’re just looking to generate short summaries then bart-large-xsum https://huggingface.co/facebook/bart-large-xsum should give you better results.

marton-avrios · July 14, 2020, 1:42pm

It’s after finetuning t5-small

valhalla · July 14, 2020, 2:10pm

what is your max length for training ?

Also I think fine-tuning bart xsum models should give you good results if you are specifically looking for short summaries

marton-avrios · July 14, 2020, 2:33pm

--max_source_length=512

valhalla · July 14, 2020, 2:54pm

Oops, sorry. I meant target max length ?

marton-avrios · July 14, 2020, 3:12pm

this is what I run (finetune.sh is basically the same as in the examples)

WANDB_PROJECT='inv_num_t5_small' ./finetune.sh \
    --data_dir ${PWD}/inv-num \
    --output_dir inv-num-results-2 \
    --model_name_or_path t5-small \
    --train_batch_size 16 --eval_batch_size 16 \
    --num_train_epochs 300 \
    --max_source_length=512 \
    --max_target_length=30 --val_max_target_length=30 --test_max_target_length=30 \
    --logger wandb

marton-avrios · July 14, 2020, 5:47pm

…also mBART learned to generate 8-9 token long summaries after 1 epoch on the same dataset. T5 should also be able to handle german input so it should not be the problem.

valhalla · July 15, 2020, 4:29am

Hi @marton-avrios, have a look at this issue, this might be the reason. T5Tokenizer dosen’t add eos token at the end of text, for now we manually need to add <s> at the end of both source and target example

marton-avrios · July 15, 2020, 7:41am

yes, seems probable. could you give me a very short example on how to do that? I tried to find something related in the doc but I only found information on how to do the most common things with the tokenizer. I suspect I should find out the id of the EOS and just append this integer to the end of the tokenized input for both source and target?

valhalla · July 15, 2020, 7:51am

You can just append a space and </s> at the end of each source and target text

so if youe source_text is This is source tetx and summary or target_text is This is summary then it’ll become
This is source text </s>
and
This is summary </s>

chrisdoyleIE · July 15, 2020, 11:33am

@marton-avrios, there was a trend within abstractive summarisation benchmarks which encouraged extractive like summaries i.e. generated summaries generated existing sentences -> and were therefore naturally longer.

As suggested by @valhalla, the Xsum task was explicitly created to encourage short abstractive summaries. Because you want a multilingual model, I suggest either first finetuning mBart or T5 on Xsum, and then try applying these models to your custom data.

marton-avrios · July 15, 2020, 5:45pm

actually mBART worked really well out of the box so I suspect the missing eos token was responsible for strange T5 results

sshleifer · July 18, 2020, 11:38am

Has anyone gotten good T5 results with/without EOS? I tried updating the tokenizer to add , but it doesn’t seem to help zero shot performance on wmt_en_ro:

From a fork of this repo, you can run

git fetch upstream
git checkout t5tok

to get a version of the tokenizer that adds EOS.

When I ran eval on wmt_en_ro (without finetuning) I got

t5tok (with `<s>`):27.65
master (no EOS): 27.87

The commands to reproduce are in this PR description

Would love to know results on other datasets!

BadDepartment · July 19, 2020, 4:07pm

I have my own corpus that I have made and that I am working with (10k observations) of small and moderate length documents. The summaries I have are very small, no greater than 18 words, the results are below.

##    rouge1  rouge2  rougeL
## P   0.517   0.309   0.496
## R   0.537   0.321   0.514
## F   0.523   0.313   0.502

I am going to try the </s> as recommended above. I suppose I am doing something similar in that I:

I add add summarize to the main text: df['body'] = 'summarize: ' + df['body']
I add a pad token to the summary text first position: df['summary'] = df['summary'].apply(lambda x: '<pad>' + x)

I guess the question is though, do we need to do it anymore? This post suggests an update, but I am not sure if its in the nightly release yet.

github.com/huggingface/transformers

Truncated Outputs by t5 fine-tuned models

opened 10:50AM - 10 Jul 20 UTC

closed 06:56PM - 25 Aug 20 UTC

manojpreveen

t5

I fine-tuned t5-small over CNN/DM dataset using the finetune_t5.sh script. The o…utputs produced by the saved fine-tuned model is okayish but it's getting cut i.e., producing incomplete sentence at the end. Example : Artcile: (CNN)The only thing crazier than a guy in snowbound Massachusetts boxing up the powdery white stuff and offering it for sale online? People are actually buying it. For $89, self-styled entrepreneur Kyle Waring will ship you 6 pounds of Boston-area snow in an insulated Styrofoam box -- enough for 10 to 15 snowballs, he says.Kyle Waring died last week. But not if you live in New England or surrounding states. "We will not ship snow to any states in the northeast!" says Waring's website, ShipSnowYo.com. "We're in the business of expunging snow!" His website and social media accounts claim to have filled more than 133 orders for snow -- more than 30 on Tuesday alone, his busiest day yet. With more than 45 total inches, Boston has set a record this winter for the snowiest month in its history. Most residents see the huge piles of snow choking their yards and sidewalks as a nuisance, but Waring saw an opportunity. According to Boston.com, it all started a few weeks ago, when Waring and his wife were shoveling deep snow from their yard in Manchester-by-the-Sea, a coastal suburb north of Boston. He joked about shipping the stuff to friends and family in warmer states, and an idea was born. His business slogan: "Our nightmare is your dream!" At first, ShipSnowYo sold snow packed into empty 16.9-ounce water bottles for $19.99, but the snow usually melted before it reached its destination. So this week, Waring began shipping larger amounts in the Styrofoam cubes, which he promises will arrive anywhere in the U.S. in less than 20 hours. He also has begun selling a 10-pound box of snow for $119. Many of his customers appear to be companies in warm-weather states who are buying the snow as a gag, he said. Whether Waring can sustain his gimmicky venture into the spring remains to be seen. But he has no shortage of product. "At this rate, it's going to be July until the snow melts," he told Boston.com. "But I've thought about taking this idea and running with it for other seasonal items. Maybe I'll ship some fall foliage." Summary produced by t5-small fine-tuned over CNN/DM : Kyle Waring will ship you 6 pounds of snow in an insulated Styrofoam box for $89 . The self-styled entrepreneur says he will not ship snow to any states in the northeast . Waring's website and social media accounts claim to have filled more than 133 orders for snow . "We're in the business of expunging snow!" Waring says . He has begun selling a 10-pound box of snow for $119 . His business slogan: "Our nightmare is your At first I thought this might be because the model hasn't converged as I just ran for 1 epoch but it's producing similar truncated outputs even for t5-small fine-tuned over cnn/dm for 5 epochs.Also this problem is not related to min_length or max_length parameters I think, as it produced similar outputs for all combinations of those two parameters. Tried changing --max_source_length, --max_target_length, --val_max_target_length, --test_max_target_length(these 4 parameters are present in finetune.py) parameter's values too from their default values before fine-tuning but no use. What might be the reason for this truncation? Is this a problem of the fine-tuning code used to fine-tune pretrained models as pre-trained models don't produce this kind of outputs.

sshleifer · July 19, 2020, 9:46pm

its not in the release yet. It will be when this pr is merged.

I’m scared to merge it because adding </s> to inputs seems to lead to truncated translations for en-fr (without finetuning) and I don’t know why.
The summaries look fine.

valhalla · July 20, 2020, 12:11pm

This is weird, as I said previously in an issue, in my experiments not adding </s> gave really bad results. Maybe instead of adding it automatically, we can mention this in the doc explicitly that </s> is necessary when fine-tuning T5.

chrisdoyleIE · August 6, 2020, 5:09pm

what kind of truncations, and was beam search being used?

sshleifer · August 6, 2020, 7:40pm

yes beam search.
last few words missing.

chrisdoyleIE · August 7, 2020, 8:30am

Do other sampling methods result in this truncation?

Topic		Replies	Views
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	330	July 5, 2023
T5 Gen Len is only 1/14 of max_target_length Beginners	3	730	October 5, 2023
Output truncation of summaries models 🤗Transformers	0	442	March 30, 2023
Summarization: Is finetune_trainer.py accepting length arguments correctly? Beginners	9	2316	December 19, 2020
T5 pretrained model truncate translation "large" text Beginners	3	1985	March 5, 2024

T5 Generates very short summaries

Related topics