Summarization pipeline on long text

Hi everyone,

I want to summarize long text and I would like suggestions about it.

I have tested the following code:

import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration

model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv")
tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")

inputs = tokenizer.encode(text, return_tensors="pt")

# Global attention on the first token (cf. Beltagy et al. 2020)
global_attention_mask = torch.zeros_like(inputs)
global_attention_mask[:, 0] = 1

max_length = int(inputs.shape[1]/3)

# Generate Summary
summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, 
                             num_beams=3, max_length=max_length)
summary =tokenizer.decode(summary_ids[0], skip_special_tokens=True, 
                          clean_up_tokenization_spaces=True)

However, I am not satisfied with the results. The summary looks like it is truncated, like it is stopped. My original text is 14000 tokens (video transcript obtained with OpenAI whisper). The other texts I want to use are also long (they are transcripts of long youtube videos, more than 1-hour video each).
I do not know if there is some way to improve:

  • There is a better pipeline you suggest?
  • There are better models that I can use for the task
  • Or there are some passages that I am forgetting to improve the pipeline? Some parameters to change in the inference? some change in the tokenization (for example, splitting or other ideas).

Thank you fo your help

Salvatore

1 Like

Hi @SalvatoreRaieli

You can use the official pipeline:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_name = "allenai/led-large-16384-arxiv"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)

long_text = "Replace by what you want."
generated_text = pipe(
    long_text, 
    truncation=True, 
    max_length=64, 
    no_repeat_ngram_size=5, 
    num_beams=3, 
    early_stopping=True
    )

I recommend to not use models trained on ArXiv or PubMed datasets because they split tokens on white space.

You can try LongT5, Pegasus-X, LED, PRIMERA models etc… for long summarization.
You can also try summarization models fine-tuned on this dataset, it can make sense with your transcripts.

1 Like

Thank you for your reply!

I was thinking at that model because the downloaded video are scientific conference.

The transcript are without punctuation (sadly whisper does not add the punctuation). since it is long text do you have also some suggestion for a punctuation pipeline for long text?

The way the punctuation is handled is model and dataset specific, you have to test various models and select the best one.

1 Like

Ok, thank you.

Do you have any suggestions or a model that in your hands worked great?

Depends on your available ressources:

  • if you need lightweight models, you can use fine-tuned BART based models from my HF repo
  • if you need bigger models, you should try LongT5 and PRIMERA models from the hub

Just test each model on a few examples to see if it makes sense, then choose the most appropriate one.

1 Like

thank you, I was checking your repo, is not designed for punctuation, right? is it a specific pipeline for the punctuation, or do I have to fine-tune it for the specific task?