I am trying to generate summaries using t5-small with a maximum target length of 30. My original inputs are german PDF invoices. I run OCR and concatenate the words to create input text. My outputs should be the invoice numbers. However even after 3 days on a V100 I get exactly 200 token long summaries (since epoch 1 or 2 out of 300) and garbage results. Summaries look like someone shuffled the original words a little but they do contain the invoice number somewhere near to the start.
What might cause it to stick to 200 generated tokens?
Hi @marton-avrios, have a look at this issue, this might be the reason. T5Tokenizer dosen’t add eos token at the end of text, for now we manually need to add <s> at the end of both source and target example
yes, seems probable. could you give me a very short example on how to do that? I tried to find something related in the doc but I only found information on how to do the most common things with the tokenizer. I suspect I should find out the id of the EOS and just append this integer to the end of the tokenized input for both source and target?
@marton-avrios, there was a trend within abstractive summarisation benchmarks which encouraged extractive like summaries i.e. generated summaries generated existing sentences -> and were therefore naturally longer.
As suggested by @valhalla, the Xsum task was explicitly created to encourage short abstractive summaries. Because you want a multilingual model, I suggest either first finetuning mBart or T5 on Xsum, and then try applying these models to your custom data.
I have my own corpus that I have made and that I am working with (10k observations) of small and moderate length documents. The summaries I have are very small, no greater than 18 words, the results are below.
## rouge1 rouge2 rougeL
## P 0.517 0.309 0.496
## R 0.537 0.321 0.514
## F 0.523 0.313 0.502
I am going to try the </s> as recommended above. I suppose I am doing something similar in that I:
I add add summarize to the main text: df['body'] = 'summarize: ' + df['body']
I add a pad token to the summary text first position: df['summary'] = df['summary'].apply(lambda x: '<pad>' + x)
I guess the question is though, do we need to do it anymore? This post suggests an update, but I am not sure if its in the nightly release yet.
This is weird, as I said previously in an issue, in my experiments not adding </s> gave really bad results. Maybe instead of adding it automatically, we can mention this in the doc explicitly that </s> is necessary when fine-tuning T5.