Summarization on long documents

echatzikyriakidis · August 29, 2020, 5:57pm

Hi to all!

I am facing a problem, how can someone summarize a very long text? I mean very long text that also always grows. It is a concatenation of many smaller texts. I see that many of the models have a limitation of maximum input, otherwise don’t work on the complete text or they don’t work at all.

So, what is the correct way of using these models with long documents.

A code snippet with an example of how to handle long documents with the existing models would be perfect to start with!

Thank you!

@sshleifer

pratikbhavsar · September 4, 2020, 6:38am

You can try extractive summarisation followed by abstractive. In the extractive step you choose top k sentences of which you choose top n allowed till model max length.

Another way is to use successive abstractive summarisation where you summarise in chunk of model max length and then again use it to summarise till the length you want. This method will be super expensive.

You can also combine first + second method.

sshleifer · September 7, 2020, 9:11pm

Yeah I have never done either, but the second method would be very easy to code.
for each document: split it into groups of ~500 words, generate 15 word summaries, blindly combine the summaries.

There is also lots of ongoing research into using Longformer for summarization, but I’m not quite sure where that stands. @valhalla or @patrickvonplaten may know.

echatzikyriakidis · September 8, 2020, 3:52pm

Thank you all!

echatzikyriakidis · September 13, 2020, 5:58pm

Hi @pratikbhavsar @sshleifer !

I want to implement the first approach but I have a question.

Here is my function for combining the top K sentences from the extractive summarization.

def concat_sentences_till_max_length(top_n_sentences, max_length):
  text = ''

  for s in top_n_sentences:
      if len(text + " " + s) <= max_length:
          text = text + " " + s

  return text

However, I don’t know how to the get the max input length of the abstractive summarizer.

I am using a summarization pipeline and I want to make it work for any transformers model.

Is the following the correct way to do it?

pipe = pipeline("summarization", model=model, framework='pt')
document = concat_sentences_till_max_length(top_n_sentences, pipe.model.config.max_length)

sshleifer · September 13, 2020, 7:09pm

I think it’s harder to control lengths through pipeline.

I would look at the run_eval.py logic
model_max_length is in token space not character space – len(text) != len(tokenizer(text))

dipanjanS · September 18, 2020, 5:08am

Just found out about the forums, so far I’ve mostly been in the GH Issues page. Glad to see some useful content here!

This is what I have been using currently, mostly chunking and generating summaries, but definitely open to hearing some good approaches on this.

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = []
      length = 0

  if sent:
    nested.append(sent)
  return nested

# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to(device).generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

Had originally posted it here

sshleifer · September 18, 2020, 1:54pm

This is great! You can try distilbart-cnn/distilbart-xsum variants to see which type of summaries you like more!

echatzikyriakidis · September 18, 2020, 6:56pm

Hi @dipanjanS!

This is great!

Let’s say now that I use AutoTokenizer and AutoModel to load agnostically a tokenizer and a model. How can I know the max input length? Instead of using 1024 as a hardcoded value how can I know the value for a specific model? Maybe reading it from some configuration?

Any help would be appreciated.

Thanks!

valhalla · September 18, 2020, 7:31pm

You can get the max length using

tokenizer.model_max_length

dipanjanS · September 18, 2020, 8:06pm

Sure will check this out some time!

dipanjanS · September 18, 2020, 8:09pm

@echatzikyriakidis thanks. The code was created for showcasing in a workshop so definitely it can be made more configurable etc. You can check the code if needed in a recent workshop I gave here.

I think as @valhalla mentioned we can use that attribute as an argument \ config for the function to reuse it across different models.

echatzikyriakidis · September 19, 2020, 10:04am

This is great! @dipanjanS @valhalla

Thank you!

echatzikyriakidis · September 20, 2020, 9:55am

@dipanjanS @valhalla I think it should be sent = [sentence] instead of sent = [].

Whenever a chunk is ready the current sentence is skipped right now.

Maybe this is the fix:

        if length < 1024:
            sentences.append(sentence)
        else:
            nested.append(sentences)
            sentences = [ sentence ]
            length = len(sentence)

spate141 · November 16, 2020, 11:33pm

@echatzikyriakidis May I ask which extractive model/approach are you using? Historically we have used the Gensim version of PageRank but looks like they are removing it from the next release. Just looking for any viable alternatives.

Thanks!

echatzikyriakidis · November 17, 2020, 7:19am

HI @spate141!

Checkout python module sumy it has many alternatives.

marcoabrate · November 17, 2020, 11:58am

Has anyone tried with Longformer / Reformer models now available?

They definitely need some fine-tuning but I have high expectations these methods will be much better than chunking.

spate141 · November 17, 2020, 2:29pm

Thanks, I’ll check it out.

spate141 · November 17, 2020, 8:11pm

Just out of curiosity, which summarizer you think would work best on plain text documents? In few of my tests I find LsaSummarizer to be most interesting one. Thoughts?

echatzikyriakidis · November 17, 2020, 8:49pm

I really don’t know. Maybe you can try them all and see the results?

What test did you accomplish and found that LsaSummarizer is better?

Topic		Replies	Views
Summarization pipeline on long text Beginners	6	4514	December 14, 2022
Longformer for text summarization Beginners	10	5258	August 6, 2022
How I fine-tune BART for summarization using large texts? Research	3	3996	September 28, 2020
How Can I Accurately Summarize Long Japanese Texts? Beginners	1	26	April 28, 2025
Help Improving Abstractive Summarization 🤗Transformers	2	986	May 19, 2021

Summarization on long documents

Related topics