I am trying to generate summaries using t5-small with a maximum target length of 30. My original inputs are german PDF invoices. I run OCR and concatenate the words to create input text. My outputs should be the invoice numbers. However even after 3 days on a V100 I get exactly 200 token long summaries (since epoch 1 or 2 out of 300) and garbage results. Summaries look like someone shuffled the original words a little but they do contain the invoice number somewhere near to the start.
What might cause it to stick to 200 generated tokens?
…also mBART learned to generate 8-9 token long summaries after 1 epoch on the same dataset. T5 should also be able to handle german input so it should not be the problem.
Hi @marton-avrios, have a look at this issue, this might be the reason. T5Tokenizer dosen’t add eos token at the end of text, for now we manually need to add <s> at the end of both source and target example
yes, seems probable. could you give me a very short example on how to do that? I tried to find something related in the doc but I only found information on how to do the most common things with the tokenizer. I suspect I should find out the id of the EOS and just append this integer to the end of the tokenized input for both source and target?
You can just append a space and </s> at the end of each source and target text
so if youe source_text is This is source tetx and summary or target_text is This is summary then it’ll become This is source text </s>
and This is summary </s>
@marton-avrios, there was a trend within abstractive summarisation benchmarks which encouraged extractive like summaries i.e. generated summaries generated existing sentences -> and were therefore naturally longer.
As suggested by @valhalla, the Xsum task was explicitly created to encourage short abstractive summaries. Because you want a multilingual model, I suggest either first finetuning mBart or T5 on Xsum, and then try applying these models to your custom data.
Has anyone gotten good T5 results with/without EOS? I tried updating the tokenizer to add , but it doesn’t seem to help zero shot performance on wmt_en_ro:
From a fork of this repo, you can run
git fetch upstream
git checkout t5tok
to get a version of the tokenizer that adds EOS.
When I ran eval on wmt_en_ro (without finetuning) I got
t5tok (with `<s>`):27.65
master (no EOS): 27.87
The commands to reproduce are in this PR description
I have my own corpus that I have made and that I am working with (10k observations) of small and moderate length documents. The summaries I have are very small, no greater than 18 words, the results are below.
## rouge1 rouge2 rougeL
## P 0.517 0.309 0.496
## R 0.537 0.321 0.514
## F 0.523 0.313 0.502
I am going to try the </s> as recommended above. I suppose I am doing something similar in that I:
I add add summarize to the main text: df['body'] = 'summarize: ' + df['body']
I add a pad token to the summary text first position: df['summary'] = df['summary'].apply(lambda x: '<pad>' + x)
I guess the question is though, do we need to do it anymore? This post suggests an update, but I am not sure if its in the nightly release yet.
its not in the release yet. It will be when this pr is merged.
I’m scared to merge it because adding </s> to inputs seems to lead to truncated translations for en-fr (without finetuning) and I don’t know why.
The summaries look fine.
This is weird, as I said previously in an issue, in my experiments not adding </s> gave really bad results. Maybe instead of adding it automatically, we can mention this in the doc explicitly that </s> is necessary when fine-tuning T5.