Summarization pipeline on long text

SalvatoreRaieli · December 13, 2022, 1:02pm

Hi everyone,

I want to summarize long text and I would like suggestions about it.

I have tested the following code:

import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration

model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv")
tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")

inputs = tokenizer.encode(text, return_tensors="pt")

# Global attention on the first token (cf. Beltagy et al. 2020)
global_attention_mask = torch.zeros_like(inputs)
global_attention_mask[:, 0] = 1

max_length = int(inputs.shape[1]/3)

# Generate Summary
summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, 
                             num_beams=3, max_length=max_length)
summary =tokenizer.decode(summary_ids[0], skip_special_tokens=True, 
                          clean_up_tokenization_spaces=True)

However, I am not satisfied with the results. The summary looks like it is truncated, like it is stopped. My original text is 14000 tokens (video transcript obtained with OpenAI whisper). The other texts I want to use are also long (they are transcripts of long youtube videos, more than 1-hour video each).
I do not know if there is some way to improve:

There is a better pipeline you suggest?
There are better models that I can use for the task
Or there are some passages that I am forgetting to improve the pipeline? Some parameters to change in the inference? some change in the tokenization (for example, splitting or other ideas).

Thank you fo your help

Salvatore

ccdv · December 13, 2022, 10:58pm

Hi @SalvatoreRaieli

You can use the official pipeline:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_name = "allenai/led-large-16384-arxiv"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)

long_text = "Replace by what you want."
generated_text = pipe(
    long_text, 
    truncation=True, 
    max_length=64, 
    no_repeat_ngram_size=5, 
    num_beams=3, 
    early_stopping=True
    )

I recommend to not use models trained on ArXiv or PubMed datasets because they split tokens on white space.

You can try LongT5, Pegasus-X, LED, PRIMERA models etc… for long summarization.
You can also try summarization models fine-tuned on this dataset, it can make sense with your transcripts.

SalvatoreRaieli · December 14, 2022, 10:08am

Thank you for your reply!

I was thinking at that model because the downloaded video are scientific conference.

The transcript are without punctuation (sadly whisper does not add the punctuation). since it is long text do you have also some suggestion for a punctuation pipeline for long text?

ccdv · December 14, 2022, 11:57am

The way the punctuation is handled is model and dataset specific, you have to test various models and select the best one.

SalvatoreRaieli · December 14, 2022, 1:12pm

Ok, thank you.

Do you have any suggestions or a model that in your hands worked great?

ccdv · December 14, 2022, 2:09pm

Depends on your available ressources:

if you need lightweight models, you can use fine-tuned BART based models from my HF repo
if you need bigger models, you should try LongT5 and PRIMERA models from the hub

Just test each model on a few examples to see if it makes sense, then choose the most appropriate one.

SalvatoreRaieli · December 14, 2022, 3:52pm

thank you, I was checking your repo, is not designed for punctuation, right? is it a specific pipeline for the punctuation, or do I have to fine-tune it for the specific task?

Topic		Replies	Views
Is summary of 1024 tokens not useless? 🤗Transformers	1	665	July 1, 2022
Summarization on long documents 🤗Transformers	63	58955	August 16, 2024
Generating longer summaries using transformers 🤗Transformers	3	282	May 22, 2023
How does the pipeline deal with too long sequences? Beginners	3	88	January 17, 2025
Token classification on long sentences 🤗Transformers	0	835	February 2, 2022

Summarization pipeline on long text

Related topics