I want to summarize long text and I would like suggestions about it.
I have tested the following code:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv")
tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")
inputs = tokenizer.encode(text, return_tensors="pt")
# Global attention on the first token (cf. Beltagy et al. 2020)
global_attention_mask = torch.zeros_like(inputs)
global_attention_mask[:, 0] = 1
max_length = int(inputs.shape[1]/3)
# Generate Summary
summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask,
num_beams=3, max_length=max_length)
summary =tokenizer.decode(summary_ids[0], skip_special_tokens=True,
clean_up_tokenization_spaces=True)
However, I am not satisfied with the results. The summary looks like it is truncated, like it is stopped. My original text is 14000 tokens (video transcript obtained with OpenAI whisper). The other texts I want to use are also long (they are transcripts of long youtube videos, more than 1-hour video each).
I do not know if there is some way to improve:
There is a better pipeline you suggest?
There are better models that I can use for the task
Or there are some passages that I am forgetting to improve the pipeline? Some parameters to change in the inference? some change in the tokenization (for example, splitting or other ideas).
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model_name = "allenai/led-large-16384-arxiv"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)
long_text = "Replace by what you want."
generated_text = pipe(
long_text,
truncation=True,
max_length=64,
no_repeat_ngram_size=5,
num_beams=3,
early_stopping=True
)
I recommend to not use models trained on ArXiv or PubMed datasets because they split tokens on white space.
You can try LongT5, Pegasus-X, LED, PRIMERA models etc… for long summarization.
You can also try summarization models fine-tuned on this dataset, it can make sense with your transcripts.
I was thinking at that model because the downloaded video are scientific conference.
The transcript are without punctuation (sadly whisper does not add the punctuation). since it is long text do you have also some suggestion for a punctuation pipeline for long text?
thank you, I was checking your repo, is not designed for punctuation, right? is it a specific pipeline for the punctuation, or do I have to fine-tune it for the specific task?