Summarization on long documents

Just out of curiosity, which summarizer you think would work best on plain text documents? In few of my tests I find LsaSummarizer to be most interesting one. Thoughts?

I really don’t know. Maybe you can try them all and see the results?

What test did you accomplish and found that LsaSummarizer is better?

Yes, I did tried all and checked the generated summaries manually against the input documents that I already have knowledge about. Basically, test was just an observation of generated summary sentences against the articles content.

I’m not sure if the ROUGE score can provide me more better picture about the quality of the extractive summarization approach since all these approaches are just picking sentences out from the input documents.

Hi …I need to know which model works for most type of data , as i see T5 , Bart , Pegasus or Bert are mainly trained with CNN , Wiki or arXiv type of data . Based on human evaluation we found T5 and Bart suitable for good Abstractive summarisation. Any one has inputs for this ?

Hi @kruthika, since the topic is summarization on long documents, I would exclude T5 a priori, since its max input length is 512, while Bart and Pegasus can be fed with max 1024 tokens.
From my experiments of summarization on biological content, both Bart and Pegasus results are very good. Concerning Bart, using the model fine-tuned on CNN is a must, otherwise it does not output very coherent summaries (in my case). On the other hand, the general Pegasus model (google/pegasus-large) gives promising results.
Since my aim is to fine-tune a model for a specific task and the summaries I need are longer than the ones found in CNN/DM, I prefer Pegasus because even without fine-tuning is not biased towards newsy / short summaries.

Hope this help!

P.S. from my understanding, Bert’s architecture does not have a Decoder so it cannot be used for text generation.

@dipanjanS’s code snippet is a good option using NLTK. Here is an alternative “pure transformers” solution:

from transformers import BartTokenizer, BartForConditionalGeneration
import torch

long_text = "This is a very very long text. " * 300

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# tokenize without truncation
inputs_no_trunc = tokenizer(long_text, max_length=None, return_tensors='pt', truncation=False)

# get batches of tokens corresponding to the exact model_max_length
chunk_start = 0
chunk_end = tokenizer.model_max_length  # == 1024 for Bart
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
    inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]  # get batch of n tokens
    inputs_batch = torch.unsqueeze(inputs_batch, 0)
    inputs_batch_lst.append(inputs_batch)
    chunk_start += tokenizer.model_max_length  # == 1024 for Bart
    chunk_end += tokenizer.model_max_length  # == 1024 for Bart

# generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]

# decode the output and join into one string with one paragraph per summary batch
summary_batch_lst = []
for summary_id in summary_ids_lst:
    summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
    summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)

print(summary_all)

# output (would of course make more sense on a sensible input):
This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
This is a very very long text. This is avery very long texts. This has been a very, very long day. This will be a very long, very, long night. I hope you enjoy it. I will be back in a week or so with a new text.

The main advantage of this approach is that it uses the tokenization directly from the transformers tokenizer instead of an external tokenizer like NLTK. Keep in mind that most transformer models use different sub-word tokenizers, while NLTK probably uses a word-level tokenizer (see explanation here). This means that NLTK will split a string like “I have a new GPU!” into 6 tokens (one per word + punctuation), while e.g. BERT’s tokenizer will split it into 7 (['i', 'have', 'a', 'new', 'gp', '##u', '!']), because it splits rare words into sub-words (e.g. GPU). With the “pure transformers” approach you can be sure to really get the exact maximum number of tokens.

The disadvantage is that there is no sentence boundary detection. You can theoretically solve that with the NLTK (or SpaCy) approach and splitting sentences. But the token threshold should probably be set below 1024 words (maybe 900?), because 1024 NLTK word tokens translate into more than 1024++ sub-word tokens. Otherwise, the text gets truncated again and you effectively delete parts of your text.

I feel like summarising texts above 1024 tokens is probably a common use case and enabling this kind of “long text summarisation” could be a very useful feature for the summarisation pipeline. Could this maybe be something that could be added to the pipeline with e.g. a keyword argument like ‘summarise_long_text=True’?
I dont know the internals of the pipeline well enough to know if this would be an easy addition or too complicated @sshleifer (Also please correct me if I’m wrong about the code and explanation above, I’m also new to this)

(Btw, look at the input and the output… :smiley: is Bart getting lazy when the text is too long and monotonous? :stuck_out_tongue: )

15 Likes

About model.generate():

  • Is there a speed difference or anything when you pass a batch of inputs instead of single tokenize input to model.generate()?
    • for example, say my tokenized_input is of shape (10, 1024) i.e. 10 lines of text with max size of 1024 tokens in each line of text. Is it common practice to pass individual 10 batch of (1, 1024) to model.generate() instead on single (10, 1024) input?

@MoritzLaurer,

Your approach using transformers tokenizer instead NLTK is great.

However, how do solve the problem of sentence boundary detection?

Fore example, using 900 words instead of 1024?

Currently, I use the NLTK approach and think to use your approach but don’t know how to cut correctly the text to have chunks with complete sentences.

Can you help me here?

Stay safe.

1 Like

My approach is to “explode” the given dataset input in sentences, use the transformer tokenizer to get the length of each sentence and calculate a nice chunking (uniform length, no split sentences). This is the function that I am using:

def chunk_text(text, num_tok):
    text_sent =\
        [sent.strip()+'.' for sent in re.split(RE_SPLITTER, text) if len(sent) > 1]

    # calculate number of tokens per sentence
    num_tok_sent = [len(tokenizer.tokenize(sent)) for sent in text_sent]
    
    # calculate chunk dimension to fit into model
    n = int(np.ceil(num_tok / MODEL_MAX_LEN))
    len_chunk = int(num_tok / n)

    # get a more uniform splitting to avoid splits
    # which are too short at the end
    if len_chunk+50 > MODEL_MAX_LEN:
        len_chunk = int(num_tok / (n+1))
    
    len_curr = 0
    text_curr = []
    text_chunk = []
    for te, len_sent in zip(text_sent, num_tok_sent):

        if len_curr + len_sent < len_chunk:
            text_curr.append(te)
            len_curr += len_sent

        elif len_curr + len_sent >= MODEL_MAX_LEN:
            text_chunk.append(text_curr)

            text_curr = [te]
            len_curr = len_sent

        else: # >= len_chunk && < MODEL_MAX_LEN
            text_curr.append(te)
            text_chunk.append(text_curr)
            
            text_curr = []
            len_curr = 0

    if len_curr > 0:
        text_chunk.append(text_curr)

    return text_chunk

where RE_SPLITTER is ‘.(?!\d)|\n’

Hope this help :slight_smile:

2 Likes

Yeah, as I wrote above, adding sentence boundary detection makes it more tricky. It’s about splitting the text into sentences, counting the tokens of each sentence with the transformers tokenizer and them adding the right number of sentences together so that the length stays below model_max_length for each batch.
@marcoabrate’s approach seems good, I couldn’t get the code to run though. I don’t quite understand the argument “num_tok”, which tokenizer “tokenizer.tokenize()” is and the regex doesn’t seem to work for me. when I change these lines I just get the text split into sentences as strings as output.

@MoritzLaurer I understand. Yes I also didn’t managed ti make it work. This is a regex that applies to most languages?

1 Like

Hey @MoritzLaurer @echatzikyriakidis

The regex .(?!\d)|\n works for Python, it just says to split where there is a full stop (not followed by a number, to avoid splitting floating points) or a new line. Consider changing it to what’s more suitable to you. For example I do not have any URL in my text, otherwise it would be a problem.

num_tok is the number of tokens of the entire text text.

The tokenizer is either Bart or Pegasus, works for both. I use the tokenize function so that i do not get BOS and EOS for each sentence.

3 Likes

Friends! @sshleifer @dipanjanS @valhalla @MoritzLaurer @marcoabrate

I would like to share with you a wrapper class I use for summarization. I would like to get feedback and any ideas for improvement. Mostly for the part of chunking text in sentences and summarizing in chunks.

However, any feedback will be welcomed. I want to make it better!

class TransformersTextSummarizer(BaseTextSummarizer):
    def __init__ (self, model_key, language):
        self._tokenizer = AutoTokenizer.from_pretrained(model_key)

        self._language = language

        self._model = AutoModelForSeq2SeqLM.from_pretrained(model_key)

        self._device = 'cuda' if bool(strtobool(os.getenv('USE_GPU'))) else 'cpu'

    def __chunk_text(self, text):
        sentences = [ s + ' ' for s in sentence_segmentation(text, minimum_n_words_to_accept_sentence=1, language=self._language) ]

        chunks = []

        chunk = ''

        length = 0

        for sentence in sentences:
            tokenized_sentence = self._tokenizer.encode(sentence, truncation=False, max_length=None, return_tensors='pt') [0]

            if len(tokenized_sentence) > self._tokenizer.model_max_length:
                continue

            length += len(tokenized_sentence)

            if length <= self._tokenizer.model_max_length:
                chunk = chunk + sentence
            else:
                chunks.append(chunk.strip())
                chunk = sentence
                length = len(tokenized_sentence)

        if len(chunk) > 0:
            chunks.append(chunk.strip())

        return chunks

    def __clean_text(self, text):
      if text.count('.') == 0:
        return text.strip()

      end_index = text.rindex('.') + 1

      return text[0 : end_index].strip()

    def summarize(self, text, *args, **kwargs):
        chunk_texts = self.__chunk_text(text)

        chunk_summaries = []

        for chunk_text in chunk_texts:
            input_tokenized = self._tokenizer.encode(chunk_text, return_tensors='pt')

            input_tokenized = input_tokenized.to(self._device)

            summary_ids = self._model.to(self._device).generate(input_tokenized, length_penalty=3.0, min_length = int(0.2 * len(chunk_text)), max_length = int(0.3 * len(chunk_text)), early_stopping=True, num_beams=5, no_repeat_ngram_size=2)

            output = [self._tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids]

            chunk_summaries.append(output)

        summaries = [ self.__clean_text(text) for chunk_summary in chunk_summaries for text in chunk_summary ]

        return summaries

Lastly, here is the sentence_segmentation implementation:

def sentence_segmentation(document, minimum_n_words_to_accept_sentence, language):
    paragraphs = list(filter(lambda o: len(o.strip()) > 0, document.split('\n')))

    paragraphs = [ p.strip() for p in paragraphs ]

    paragraph_sentences = [ sent_tokenize(p, language=language) for p in paragraphs ]

    paragraph_sentences = chain(*paragraph_sentences)

    paragraph_sentences = [ s.strip() for s in paragraph_sentences ]

    normal_word_tokenizer = RegexpTokenizer(r'[^\W_]+')

    paragraph_sentences = filter(lambda o: len(normal_word_tokenizer.tokenize(o)) >= minimum_n_words_to_accept_sentence, paragraph_sentences)

    return list(paragraph_sentences)

Thank you all and stay safe!

1 Like

Hello everyone,

There’s definitely room for adding some code into the pipelines code to make it work natively.

  • splitting the input into chunks of model_max_length
  • summarizing each
  • concatenating them back

Extractive, then abstractive summarization is the other best alternative.

is a valid way to go about it. However, if you have a very small trailing chunk, the summarization output tends to be garbage, so you should definitely ignore it (it probably won’t change the overall meaning of the original text). Also this simple chunking approach might split up important context into 2 different batches making it hard on the summarization model. But it’s probably not worth investing too much into this as the probability is unlikely if the text is not already a summary (as in very dense in important information).

We will start integrating it back into transformers but it will take time (because code quality and backward compatibility are a prime).

FYI we’re enabling this automatically in our hosted API inference (where we can make faster iterations): https://huggingface.co/pricing

1 Like

Hi @Narsil

So you suggest to skip the last chunk? What if we try to generate the summary with?:

min_length = int(0.2 * len(chunk_text)), max_length = int(0.3 * len(chunk_text))

How would you improve that implementation?

It’s hard to know without knowing how you trained/finetuned the model at that point. Performance on incoming data that does not resemble the training one is likely to not perform well.

All I was mentioning was that truncating is sometimes a valid strategy (especially if your input is simply slightly above available your actual max_length, something like < 1.1 * max_length). The overall summary quality is better than doing summarization on a very small chunk (< 0.1 max_length) which is mostly likely to simply repeat the input leading to a good summary concatenated with the end of the article.

As always the best way is still to try different options and see what works best for your use case on your data.

Hi @Narsil,

I understand. I haven’t trained/fine-tuned any model. I use sshleifer/distilbart-cnn-12-6 and I try to summarize responses to questions.

Hipe it helps now.

Can we not use LED for this task. The documentation indicates that its designed for long documents.

But my results with LED have not been encouraging. it seems to repeat a lot of phrases. Any additional suggestions would be useful.

I was interested in using LED to summarize regulatory financial reports of public companies.

Thanks, @MoritzLaurer @echatzikyriakidis for great points on this topic, very helpful

Thanks

2 Likes

@echatzikyriakidis you could also use a divide-and-conquer approach, like the one we recently published at IEEE TASLP: A Divide-and-Conquer Approach to the Summarization of Long Documents - IEEE Journals & Magazine

2 Likes

Hi All,

I am trying to solve the following problem: Text Summarization of Movies (Which has human speech to text and non-human sounds)
I am currently providing the input (Human speech + Non human sound) text to get the results from extractive / abstractive model of text summarization. I observed that If i do extractive summarization, it gives me an overview of the movie but lacks specifics. I am unable to use the abstractive model for the long text since the input is > 512. For abstractive if i do scene by scene it gives me a good output. However, i am unable to draw meaningful summaries from both!
Are there any other abstractive models that i could use ? (i used google T5). Also, other suggestions are welcome. Thanks!

1 Like