Summarization on long documents

Hi to all!

I am facing a problem, how can someone summarize a very long text? I mean very long text that also always grows. It is a concatenation of many smaller texts. I see that many of the models have a limitation of maximum input, otherwise don’t work on the complete text or they don’t work at all.

So, what is the correct way of using these models with long documents.

A code snippet with an example of how to handle long documents with the existing models would be perfect to start with!

Thank you!

@sshleifer

2 Likes

You can try extractive summarisation followed by abstractive. In the extractive step you choose top k sentences of which you choose top n allowed till model max length.

Another way is to use successive abstractive summarisation where you summarise in chunk of model max length and then again use it to summarise till the length you want. This method will be super expensive.

You can also combine first + second method.

4 Likes

Yeah I have never done either, but the second method would be very easy to code.
for each document: split it into groups of ~500 words, generate 15 word summaries, blindly combine the summaries.

There is also lots of ongoing research into using Longformer for summarization, but I’m not quite sure where that stands. @valhalla or @patrickvonplaten may know.

1 Like

Thank you all!

Hi @pratikbhavsar @sshleifer !

I want to implement the first approach but I have a question.

Here is my function for combining the top K sentences from the extractive summarization.

def concat_sentences_till_max_length(top_n_sentences, max_length):
  text = ''

  for s in top_n_sentences:
      if len(text + " " + s) <= max_length:
          text = text + " " + s

  return text

However, I don’t know how to the get the max input length of the abstractive summarizer.

I am using a summarization pipeline and I want to make it work for any transformers model.

Is the following the correct way to do it?

pipe = pipeline("summarization", model=model, framework='pt')
document = concat_sentences_till_max_length(top_n_sentences, pipe.model.config.max_length)

I think it’s harder to control lengths through pipeline.

  • I would look at the run_eval.py logic
  • model_max_length is in token space not character space – len(text) != len(tokenizer(text))

Just found out about the forums, so far I’ve mostly been in the GH Issues page. Glad to see some useful content here!

This is what I have been using currently, mostly chunking and generating summaries, but definitely open to hearing some good approaches on this.

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = []
      length = 0

  if sent:
    nested.append(sent)
  return nested

# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to(device).generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

Had originally posted it here

4 Likes

This is great! You can try distilbart-cnn/distilbart-xsum variants to see which type of summaries you like more!

Hi @dipanjanS!

This is great!

Let’s say now that I use AutoTokenizer and AutoModel to load agnostically a tokenizer and a model. How can I know the max input length? Instead of using 1024 as a hardcoded value how can I know the value for a specific model? Maybe reading it from some configuration?

Any help would be appreciated.

Thanks!

You can get the max length using

tokenizer.model_max_length

Sure will check this out some time!

@echatzikyriakidis thanks. The code was created for showcasing in a workshop so definitely it can be made more configurable etc. You can check the code if needed in a recent workshop I gave here.

I think as @valhalla mentioned we can use that attribute as an argument \ config for the function to reuse it across different models.

1 Like

This is great! @dipanjanS @valhalla

Thank you!

@dipanjanS @valhalla I think it should be sent = [sentence] instead of sent = [].

Whenever a chunk is ready the current sentence is skipped right now.

Maybe this is the fix:

        if length < 1024:
            sentences.append(sentence)
        else:
            nested.append(sentences)
            sentences = [ sentence ]
            length = len(sentence)

@echatzikyriakidis May I ask which extractive model/approach are you using? Historically we have used the Gensim version of PageRank but looks like they are removing it from the next release. Just looking for any viable alternatives.

Thanks!

HI @spate141!

Checkout python module sumy it has many alternatives.

Has anyone tried with Longformer / Reformer models now available?

They definitely need some fine-tuning but I have high expectations these methods will be much better than chunking.

1 Like

Thanks, I’ll check it out.

Just out of curiosity, which summarizer you think would work best on plain text documents? In few of my tests I find LsaSummarizer to be most interesting one. Thoughts?

I really don’t know. Maybe you can try them all and see the results?

What test did you accomplish and found that LsaSummarizer is better?