Summarization on long documents

Nice @Kwame . What your implementation has is actually overlapping chunks. But I don’t think if it is ok to cut a sentence in half. My implementation cuts the text in chunks so that they can be summarized by a model but it is never chunks a sentence in two parts. If a sentence cannot be added to the chunk it is transfered to the next chunk.

you can use this approach for your abstractive summarization

2 Likes

OpenAI just published a pretty amazing piece of work showing how they combined reinforcement learning with human feedback to summarise entire books :exploding_head:

Unfortunately, the models don’t appear to be open sourced, but the general technique is very cool and fun to read :slight_smile:

h/t @lvwerra for telling me about this paper

2 Likes

To be more specific we can use

tokenizer.max_len_single_sentence

as the rest of the space is taken up by the special tokens expected by the model. If I am not wrong

tokenizer.model_max_length = tokenizer.max_len_single_sentence + tokenizer.num_special_tokens_to_add()

How exactly are you delimiting the sentence boundaries?

@MoritzLaurer 's code is really cool. But it will most likely strip off the special tokens and its only going to pass the input_ids to the model. I thought why not pass the model whatever it has originally seen during its training. So, I have re-written the logic below and also taken care of not breaking the sentences into parts with the help of nltk. @echatzikyriakidis

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

checkpoint = "google/pegasus-xsum"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

long_text = "This is a very very long text. " * 300


sentences = nltk.tokenize.sent_tokenize(long_text)

# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk) # save the chunk
    
  else: 
    chunks.append(chunk) # save the chunk
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))

# inputs
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# print summary
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

and the output is:

This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.

Some intermediate results:

[len(tokenizer(c.strip()).input_ids) for c in chunks]

gives:

[505, 505, 505, 505, 385]

which are well within tokenizer.model_max_length of 512.

Do let me know if anything seems weird.

1 Like

@saprativa your code is almost the same I tried to code myself, thanks so much. Only last bit I’m missing.

I want to merge several articles on the same topic, so input text can be very long, and I see your code splits all text into sentences to fit the model, but the result I got contains several similar sentences.

So I’m looking at how to sort all sentences by similarity before making a summary. Come up to this video Sentence Similarity using HuggingFace's Sentence Transformers v2 - YouTube, but decided to ask here anyway about your opinion.

Is this is the only way to use cosine_similarity for this (guess I need to compare each sentence with the remaining ones, and based on this sort the list). Ideally, do this on a paragraph basis, but original articles might not have the same paragraphs, so for simplicity, I’m trying to do this just for a sentences list.

Also was looking at this model sentence-transformers/paraphrase-xlm-r-multilingual-v1 · Hugging Face, but due to me being a newbie in this topic, still struggling with how to merge all those solutions to achieve my goal.

Quick question, for the pretrained model google/bigbird-pegasus-large-bigpatent, do we know how did they manage to do the training?
The patents tend to be way longer than 4096 tokens…

Are you sure? According to the paper, BigPatent has an average input length similar to PubMed.

We do have an example notebook of evaluating BigBird on PubMed summarization here.

1 Like

BigPatent actually just took the patent description and used it to generate the patent abstract, so it does not take the whole patent as input (that would be much longer, as you suggested).

Hi all, found this thread very relevant to an exercise I’m undertaking so I thought I’d share my thought process and get your views on it. I’m looking to train a model on the BookSum dataset (same dataset OpenAI used in this paper) to be able to summarise book chapters.

Instead of using the summaries of summaries approach I was looking to use models converted to a LongFormer format to summarise entire chapters in one go. My thinking was to undertake the following experiments:

  1. Convert t5-3b to a longformer encoder decoder format and finetune on BookSum
  2. Fine-tune Pegasus_with_Longformer_summarization on BookSum
  3. Port model from the Efficient Attentions for Long Document Summarization Paper to HF and fine-tune on BookSum

What do you all think?

Hi,
Maybe you can try this model, there is a link to a conversion script at the end of the readme to convert BART models for longer sequences

Hi there, I’ve recently published a survey paper on Abstractive Text Summarization for both short and long documents. As a result, and to the best of my knowledge, the top-performing models for long document summarization on the three popular datasets (arXiv, PubMed, BigPatent) are LSH and BIGBIRD-Pegasus

However, ROUGE scores of the best models in this field, as concluded by the paper, can be shown in the image below.

Good luck in your projects!

long text summarization best models

can you write a full code?

nltk missing
bart_tokenizer missing
bart_model missing

this is not working when i try to run like below



import logging
from transformers import pipeline



#summarizer = pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")

f = open("TextFile1.txt", "r")

ARTICLE = f.read()

#print(summarizer(ARTICLE, max_length=900, do_sample=False))

#
#summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )

def do_strided_tokenization(the_content,
                        tokenizer,
                        number_token_strides = 2,
                        maximum_number_of_tokens = None):
    if not maximum_number_of_tokens:
        maximum_number_of_tokens = tokenizer.model_max_length
    strided_tokenized_content = None
    the_input_ids =\
        tokenizer.batch_encode_plus(
                [the_content], 
                 return_overflowing_tokens=True,
                 truncation=True,
                 max_length=maximum_number_of_tokens,
                 stride=number_token_strides
            )['input_ids']

    strided_tokenized_content =\
        tokenizer.batch_decode(the_input_ids)
    
    return strided_tokenized_content

test_string = 'The red fox jumped over the blue, rainbow.'

print(
    do_strided_tokenization(test_string,
                            summarizer.summarizer.tokenizer,
                            number_token_strides=2,
                            maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)

can you post a full working example with input and output and imports

@MoritzLaurer I really like your idea of chopping up tokenizer’s [“input_ids”][0] into chunks and unsqueezing each of them. When I tried it myself, however, I bumped into a huge issue:

“DefaultCPUAllocator: not enough memory”

I googled how pytorch may throw this error, and it seems to me that it’s because a tensor is so big that no reasonable RAM can handle it. Will it be correct to narrow down the source of the problem at torch.unsqueeze()? How can I chop up or even pickle individual tensor even more just so my computer can run the code without dying?

P.S: Anyone is welcome to take a stab at educating me :slight_smile:
Thanks!