How to generate a samples of summaries with Pegasus?

I am a beginner around here. I would like to generate some samples of abstractive text summaries using Pegasus. I used the code snippet from Pegasus — transformers 4.3.0 documentation . However, I realize the summary is the same everytime I use decode.

How should I generate distinct samples of summaries? Thank you in advance. :slight_smile:

2 Likes

Here is an example of generating summaries with custom parameters.
I am only using a small subset of the available tweaks. For more info, I find this blog post very helpful:
How to generate text: using different decoding methods for language generation with Transformers.

#! pip install transformers
#! pip install datasets
#! pip install sentencepiece

from transformers import PegasusTokenizer, PegasusForConditionalGeneration
import datasets

model = PegasusForConditionalGeneration.from_pretrained("sshleifer/distill-pegasus-xsum-16-4")
tokenizer = PegasusTokenizer.from_pretrained("sshleifer/distill-pegasus-xsum-16-4")

# Download data samples
data = datasets.load_dataset("xsum", split="validation[:10]")

# Pick two examples
text2summarize_1 = data["document"][0]
text2summarize_2 = data["document"][3]

#print(text2summarize_1) 
#print(text2summarize_2)

def generate_for_sample(sample, **kwargs):
    """
    Returns decoded summary (code snippets from the docs)
    kwargs are passed on to the model's generate function
    """
    inputs = tokenizer(sample, truncation=True, max_length=1024, return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], **kwargs)
    return [tokenizer.decode(g, 
                             skip_special_tokens=True, 
                             clean_up_tokenization_spaces=False) for g in summary_ids]

print("Summaries generated with default parameters:")
summary_1 = generate_for_sample(text2summarize_1)
summary_2 = generate_for_sample(text2summarize_2)
print("summary_1: {}".format(summary_1))
print("summary_2: {}".format(summary_2))
print("Some default parameter values: ", "num_beams={}, do_sample={}, top_k={}, top_p={}".
      format(model.config.num_beams, model.config.do_sample, model.config.top_k, model.config.top_p))

print("Summaries generated with custom parameter values:")
summary_1 = generate_for_sample(text2summarize_1, num_beams=4)
summary_2 = generate_for_sample(text2summarize_2, do_sample=True, top_k=10, top_p=0.8)
print("summary_1: {}".format(summary_1))
print("summary_2: {}".format(summary_2))

Output:

Summaries generated with default parameters:
summary_1: [‘Apple has been accused of misleading customers in Australia over its new iPad.’]
summary_2: [“The world’s first marine energy system has been installed in the North Sea.”]
Some default parameter values: num_beams=8, do_sample=False, top_k=50, top_p=1.0

Summaries generated with custom parameter values:
summary_1: [‘Apple is facing legal action in Australia over its new iPad with wi-fi and 4G.’]
summary_2: [‘A marine energy system has been installed in the North Sea for the first time.’]

5 Likes

Thank you so much! really appreciated for your help.

Pegasus is a state-of-the-art abstractive text summarization model developed by Google Research. Generating summaries using Pegasus typically involves fine-tuning the pre-trained model on a specific summarization dataset and then using the fine-tuned model to generate summaries for new input text.

Here’s a high-level overview of how you can generate summaries using Pegasus:

Preprocessing:
Prepare your data in a suitable format for fine-tuning, ensuring it aligns with the requirements of the Pegasus model (e.g., tokenization, formatting, polyform us).

Fine-tuning:
a. Fine-tune the pre-trained Pegasus model on your summarization dataset.
b. During fine-tuning, use the provided summaries in your dataset as targets for the model to learn from.

Generating Summaries:
After fine-tuning, you can use the fine-tuned Pegasus model to generate summaries for new input text. The model will generate abstractive summaries based on the training it received during fine-tuning.