Summarization on long documents

echatzikyriakidis · June 2, 2021, 4:16pm

Nice @Kwame . What your implementation has is actually overlapping chunks. But I don’t think if it is ok to cut a sentence in half. My implementation cuts the text in chunks so that they can be summarized by a model but it is never chunks a sentence in two parts. If a sentence cannot be added to the chunk it is transfered to the next chunk.

usama · June 22, 2021, 5:03am

you can use this approach for your abstractive summarization

lewtun · September 23, 2021, 9:33pm

OpenAI just published a pretty amazing piece of work showing how they combined reinforcement learning with human feedback to summarise entire books

Unfortunately, the models don’t appear to be open sourced, but the general technique is very cool and fun to read

h/t @lvwerra for telling me about this paper

saprativa · September 25, 2021, 10:01am

To be more specific we can use

tokenizer.max_len_single_sentence

as the rest of the space is taken up by the special tokens expected by the model. If I am not wrong

tokenizer.model_max_length = tokenizer.max_len_single_sentence + tokenizer.num_special_tokens_to_add()

saprativa · September 25, 2021, 2:01pm

How exactly are you delimiting the sentence boundaries?

saprativa · September 26, 2021, 6:10am

@MoritzLaurer 's code is really cool. But it will most likely strip off the special tokens and its only going to pass the input_ids to the model. I thought why not pass the model whatever it has originally seen during its training. So, I have re-written the logic below and also taken care of not breaking the sentences into parts with the help of nltk. @echatzikyriakidis

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

checkpoint = "google/pegasus-xsum"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

long_text = "This is a very very long text. " * 300


sentences = nltk.tokenize.sent_tokenize(long_text)

# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk) # save the chunk
    
  else: 
    chunks.append(chunk) # save the chunk
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))

# inputs
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# print summary
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

and the output is:

This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.

Some intermediate results:

[len(tokenizer(c.strip()).input_ids) for c in chunks]

gives:

[505, 505, 505, 505, 385]

which are well within tokenizer.model_max_length of 512.

Do let me know if anything seems weird.

SergeySypalo · October 19, 2021, 10:04am

@saprativa your code is almost the same I tried to code myself, thanks so much. Only last bit I’m missing.

I want to merge several articles on the same topic, so input text can be very long, and I see your code splits all text into sentences to fit the model, but the result I got contains several similar sentences.

So I’m looking at how to sort all sentences by similarity before making a summary. Come up to this video Sentence Similarity using HuggingFace's Sentence Transformers v2 - YouTube, but decided to ask here anyway about your opinion.

Is this is the only way to use cosine_similarity for this (guess I need to compare each sentence with the remaining ones, and based on this sort the list). Ideally, do this on a paragraph basis, but original articles might not have the same paragraphs, so for simplicity, I’m trying to do this just for a sentences list.

Also was looking at this model sentence-transformers/paraphrase-xlm-r-multilingual-v1 · Hugging Face, but due to me being a newbie in this topic, still struggling with how to merge all those solutions to achieve my goal.

ArnauC · November 18, 2021, 8:25am

Quick question, for the pretrained model google/bigbird-pegasus-large-bigpatent, do we know how did they manage to do the training?
The patents tend to be way longer than 4096 tokens…

nielsr · November 18, 2021, 11:42am

Are you sure? According to the paper, BigPatent has an average input length similar to PubMed.

We do have an example notebook of evaluating BigBird on PubMed summarization here.

silvia-casola · November 27, 2021, 9:39pm

BigPatent actually just took the patent description and used it to generate the patent abstract, so it does not take the whole patent as input (that would be much longer, as you suggested).

kmfoda · December 8, 2021, 5:42pm

Hi all, found this thread very relevant to an exercise I’m undertaking so I thought I’d share my thought process and get your views on it. I’m looking to train a model on the BookSum dataset (same dataset OpenAI used in this paper) to be able to summarise book chapters.

Instead of using the summaries of summaries approach I was looking to use models converted to a LongFormer format to summarise entire chapters in one go. My thinking was to undertake the following experiments:

Convert t5-3b to a longformer encoder decoder format and finetune on BookSum
Fine-tune Pegasus_with_Longformer_summarization on BookSum
Port model from the Efficient Attentions for Long Document Summarization Paper to HF and fine-tune on BookSum

What do you all think?

tkon3 · December 28, 2021, 4:26pm

Hi,
Maybe you can try this model, there is a link to a conversion script at the end of the readme to convert BART models for longer sequences

Ayham · December 29, 2021, 2:28pm

Hi there, I’ve recently published a survey paper on Abstractive Text Summarization for both short and long documents. As a result, and to the best of my knowledge, the top-performing models for long document summarization on the three popular datasets (arXiv, PubMed, BigPatent) are LSH and BIGBIRD-Pegasus

However, ROUGE scores of the best models in this field, as concluded by the paper, can be shown in the image below.

Good luck in your projects!

long text summarization best models

anon89001014 · October 28, 2022, 8:38pm

can you write a full code?

nltk missing
bart_tokenizer missing
bart_model missing

anon89001014 · October 28, 2022, 8:41pm

this is not working when i try to run like below



import logging
from transformers import pipeline



#summarizer = pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")

f = open("TextFile1.txt", "r")

ARTICLE = f.read()

#print(summarizer(ARTICLE, max_length=900, do_sample=False))

#
#summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )

def do_strided_tokenization(the_content,
                        tokenizer,
                        number_token_strides = 2,
                        maximum_number_of_tokens = None):
    if not maximum_number_of_tokens:
        maximum_number_of_tokens = tokenizer.model_max_length
    strided_tokenized_content = None
    the_input_ids =\
        tokenizer.batch_encode_plus(
                [the_content], 
                 return_overflowing_tokens=True,
                 truncation=True,
                 max_length=maximum_number_of_tokens,
                 stride=number_token_strides
            )['input_ids']

    strided_tokenized_content =\
        tokenizer.batch_decode(the_input_ids)
    
    return strided_tokenized_content

test_string = 'The red fox jumped over the blue, rainbow.'

print(
    do_strided_tokenization(test_string,
                            summarizer.summarizer.tokenizer,
                            number_token_strides=2,
                            maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)

anon89001014 · October 28, 2022, 8:50pm

echatzikyriakidis:

class TransformersTextSummarizer(BaseTextSummarizer):
    def __init__ (self, model_key, language):
        self._tokenizer = AutoTokenizer.from_pretrained(model_key)

        self._language = language

        self._model = AutoModelForSeq2SeqLM.from_pretrained(model_key)

        self._device = 'cuda' if bool(strtobool(os.getenv('USE_GPU'))) else 'cpu'

    def __chunk_text(self, text):
        sentences = [ s + ' ' for s in sentence_segmentation(text, minimum_n_words_to_accept_sentence=1, language=self._language) ]

        chunks = []

        chunk = ''

        length = 0

        for sentence in sentences:
            tokenized_sentence = self._tokenizer.encode(sentence, truncation=False, max_length=None, return_tensors='pt') [0]

            if len(tokenized_sentence) > self._tokenizer.model_max_length:
                continue

            length += len(tokenized_sentence)

            if length <= self._tokenizer.model_max_length:
                chunk = chunk + sentence
            else:
                chunks.append(chunk.strip())
                chunk = sentence
                length = len(tokenized_sentence)

        if len(chunk) > 0:
            chunks.append(chunk.strip())

        return chunks

    def __clean_text(self, text):
      if text.count('.') == 0:
        return text.strip()

      end_index = text.rindex('.') + 1

      return text[0 : end_index].strip()

    def summarize(self, text, *args, **kwargs):
        chunk_texts = self.__chunk_text(text)

        chunk_summaries = []

        for chunk_text in chunk_texts:
            input_tokenized = self._tokenizer.encode(chunk_text, return_tensors='pt')

            input_tokenized = input_tokenized.to(self._device)

            summary_ids = self._model.to(self._device).generate(input_tokenized, length_penalty=3.0, min_length = int(0.2 * len(chunk_text)), max_length = int(0.3 * len(chunk_text)), early_stopping=True, num_beams=5, no_repeat_ngram_size=2)

            output = [self._tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids]

            chunk_summaries.append(output)

        summaries = [ self.__clean_text(text) for chunk_summary in chunk_summaries for text in chunk_summary ]

        return summaries

can you post a full working example with input and output and imports

timmyTypeError · November 23, 2022, 4:46pm

@MoritzLaurer I really like your idea of chopping up tokenizer’s [“input_ids”][0] into chunks and unsqueezing each of them. When I tried it myself, however, I bumped into a huge issue:

“DefaultCPUAllocator: not enough memory”

I googled how pytorch may throw this error, and it seems to me that it’s because a tensor is so big that no reasonable RAM can handle it. Will it be correct to narrow down the source of the problem at torch.unsqueeze()? How can I chop up or even pickle individual tensor even more just so my computer can run the code without dying?

P.S: Anyone is welcome to take a stab at educating me
Thanks!

pleonova · December 24, 2022, 6:27pm

A slight modification to dipanjanS’s solution (btw thank you so much for sharing it).

I didn’t want to use any tokenizer for a different use case, so this is a workaround.

import re

def create_nested_sentences(document, token_max_length):
    nested = []
    sent = []
    length = 0
    for sentence in re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', document.replace("\n", ' ')):
        length += sentence.count(" ") # Rough estimate of full word tokens
        if length < (token_max_length * 0.9): # Add a buffer since these tokens are unlikely to be the same as a LM
            sent.append(sentence)
        else:
            nested.append(sent)
            sent = [sentence]
            length = 0
    if sent:
        nested.append(sent)
    return nested

Here’s a version using a specific transformer models.py · pleonova/multi-label-summary-text at main

AndreLearning · December 29, 2022, 11:45pm

Thank you very much for enriching this post so much with your deep review!

ananddeshpande · July 6, 2023, 12:08pm

Hi,

I have a large medical record document (Attending medical statement) which has a lot of tables. I need to create a summary document for this large PDF, any idea which model would be the best for this and what approach should I follow.

Topic		Replies	Views
Summarization pipeline on long text Beginners	6	4361	December 14, 2022
Longformer for text summarization Beginners	10	5214	August 6, 2022
How I fine-tune BART for summarization using large texts? Research	3	3942	September 28, 2020
Help Improving Abstractive Summarization 🤗Transformers	2	984	May 19, 2021
Finetuning transformers for long document summarisation Beginners	0	339	October 25, 2022

Summarization on long documents

Related topics