I am a beginner around here. I would like to generate some samples of abstractive text summaries using Pegasus. I used the code snippet from Pegasus — transformers 4.3.0 documentation . However, I realize the summary is the same everytime I use decode.
How should I generate distinct samples of summaries? Thank you in advance.
#! pip install transformers
#! pip install datasets
#! pip install sentencepiece
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
import datasets
model = PegasusForConditionalGeneration.from_pretrained("sshleifer/distill-pegasus-xsum-16-4")
tokenizer = PegasusTokenizer.from_pretrained("sshleifer/distill-pegasus-xsum-16-4")
# Download data samples
data = datasets.load_dataset("xsum", split="validation[:10]")
# Pick two examples
text2summarize_1 = data["document"][0]
text2summarize_2 = data["document"][3]
#print(text2summarize_1)
#print(text2summarize_2)
def generate_for_sample(sample, **kwargs):
"""
Returns decoded summary (code snippets from the docs)
kwargs are passed on to the model's generate function
"""
inputs = tokenizer(sample, truncation=True, max_length=1024, return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], **kwargs)
return [tokenizer.decode(g,
skip_special_tokens=True,
clean_up_tokenization_spaces=False) for g in summary_ids]
print("Summaries generated with default parameters:")
summary_1 = generate_for_sample(text2summarize_1)
summary_2 = generate_for_sample(text2summarize_2)
print("summary_1: {}".format(summary_1))
print("summary_2: {}".format(summary_2))
print("Some default parameter values: ", "num_beams={}, do_sample={}, top_k={}, top_p={}".
format(model.config.num_beams, model.config.do_sample, model.config.top_k, model.config.top_p))
print("Summaries generated with custom parameter values:")
summary_1 = generate_for_sample(text2summarize_1, num_beams=4)
summary_2 = generate_for_sample(text2summarize_2, do_sample=True, top_k=10, top_p=0.8)
print("summary_1: {}".format(summary_1))
print("summary_2: {}".format(summary_2))
Output:
Summaries generated with default parameters:
summary_1: [‘Apple has been accused of misleading customers in Australia over its new iPad.’]
summary_2: [“The world’s first marine energy system has been installed in the North Sea.”]
Some default parameter values: num_beams=8, do_sample=False, top_k=50, top_p=1.0
Summaries generated with custom parameter values:
summary_1: [‘Apple is facing legal action in Australia over its new iPad with wi-fi and 4G.’]
summary_2: [‘A marine energy system has been installed in the North Sea for the first time.’]
Pegasus is a state-of-the-art abstractive text summarization model developed by Google Research. Generating summaries using Pegasus typically involves fine-tuning the pre-trained model on a specific summarization dataset and then using the fine-tuned model to generate summaries for new input text.
Here’s a high-level overview of how you can generate summaries using Pegasus:
Preprocessing:
Prepare your data in a suitable format for fine-tuning, ensuring it aligns with the requirements of the Pegasus model (e.g., tokenization, formatting, polyform us).
Fine-tuning:
a. Fine-tune the pre-trained Pegasus model on your summarization dataset.
b. During fine-tuning, use the provided summaries in your dataset as targets for the model to learn from.
Generating Summaries:
After fine-tuning, you can use the fine-tuned Pegasus model to generate summaries for new input text. The model will generate abstractive summaries based on the training it received during fine-tuning.