T5 Generates very short summaries

I am trying to generate summaries using t5-small with a maximum target length of 30. My original inputs are german PDF invoices. I run OCR and concatenate the words to create input text. My outputs should be the invoice numbers. However even after 3 days on a V100 I get exactly 200 token long summaries (since epoch 1 or 2 out of 300) and garbage results. Summaries look like someone shuffled the original words a little but they do contain the invoice number somewhere near to the start.

What might cause it to stick to 200 generated tokens?

1 Like

Hi, are you fine-tuning the model or just generating summaries ?

If you’re just looking to generate short summaries then bart-large-xsum https://huggingface.co/facebook/bart-large-xsum should give you better results.

3 Likes

It’s after finetuning t5-small

what is your max length for training ?

Also I think fine-tuning bart xsum models should give you good results if you are specifically looking for short summaries

--max_source_length=512

Oops, sorry. I meant target max length ?

this is what I run (finetune.sh is basically the same as in the examples)

WANDB_PROJECT='inv_num_t5_small' ./finetune.sh \
    --data_dir ${PWD}/inv-num \
    --output_dir inv-num-results-2 \
    --model_name_or_path t5-small \
    --train_batch_size 16 --eval_batch_size 16 \
    --num_train_epochs 300 \
    --max_source_length=512 \
    --max_target_length=30 --val_max_target_length=30 --test_max_target_length=30 \
    --logger wandb

…also mBART learned to generate 8-9 token long summaries after 1 epoch on the same dataset. T5 should also be able to handle german input so it should not be the problem.

Hi @marton-avrios, have a look at this issue, this might be the reason. T5Tokenizer dosen’t add eos token at the end of text, for now we manually need to add <s> at the end of both source and target example

3 Likes

yes, seems probable. could you give me a very short example on how to do that? I tried to find something related in the doc but I only found information on how to do the most common things with the tokenizer. I suspect I should find out the id of the EOS and just append this integer to the end of the tokenized input for both source and target?

You can just append a space and </s> at the end of each source and target text

so if youe source_text is This is source tetx and summary or target_text is This is summary then it’ll become
This is source text </s>
and
This is summary </s>

1 Like

@marton-avrios, there was a trend within abstractive summarisation benchmarks which encouraged extractive like summaries i.e. generated summaries generated existing sentences -> and were therefore naturally longer.

As suggested by @valhalla, the Xsum task was explicitly created to encourage short abstractive summaries. Because you want a multilingual model, I suggest either first finetuning mBart or T5 on Xsum, and then try applying these models to your custom data.

1 Like

actually mBART worked really well out of the box so I suspect the missing eos token was responsible for strange T5 results

Has anyone gotten good T5 results with/without EOS? I tried updating the tokenizer to add , but it doesn’t seem to help zero shot performance on wmt_en_ro:

From a fork of this repo, you can run

git fetch upstream
git checkout t5tok

to get a version of the tokenizer that adds EOS.

When I ran eval on wmt_en_ro (without finetuning) I got

t5tok (with `<s>`):27.65
master (no EOS): 27.87

The commands to reproduce are in this PR description

Would love to know results on other datasets!

I have my own corpus that I have made and that I am working with (10k observations) of small and moderate length documents. The summaries I have are very small, no greater than 18 words, the results are below.

##    rouge1  rouge2  rougeL
## P   0.517   0.309   0.496
## R   0.537   0.321   0.514
## F   0.523   0.313   0.502

I am going to try the </s> as recommended above. I suppose I am doing something similar in that I:

  1. I add add summarize to the main text: df['body'] = 'summarize: ' + df['body']
  2. I add a pad token to the summary text first position: df['summary'] = df['summary'].apply(lambda x: '<pad>' + x)

I guess the question is though, do we need to do it anymore? This post suggests an update, but I am not sure if its in the nightly release yet.

its not in the release yet. It will be when this pr is merged.

I’m scared to merge it because adding </s> to inputs seems to lead to truncated translations for en-fr (without finetuning) and I don’t know why.
The summaries look fine.

This is weird, as I said previously in an issue, in my experiments not adding </s> gave really bad results. Maybe instead of adding it automatically, we can mention this in the doc explicitly that </s> is necessary when fine-tuning T5.

what kind of truncations, and was beam search being used?

yes beam search.
last few words missing.

Do other sampling methods result in this truncation?