BART Paraphrasing

I’ve been using BART to summarize, and I have noticed some of the outputs resembling paraphrases.

Is there a way for me to build on this, and use the model for paraphrasing primarily?

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
import torch

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
device = torch.device('cpu')

text = "At the core of the United States' mismanagement of the Coronavirus lies its distrust of science"


preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

summary_ids = model.generate(tokenized_text,
                        num_beams=10,
                        no_repeat_ngram_size=1,
                        min_length=10,
                        num_return_sequences = 2,
                        max_length=20,
                        top_k = 100,
                        early_stopping=True)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
output1 = tokenizer.decode(summary_ids[1], skip_special_tokens=True)

Summarized Text: The United States' mismanagement of the Coronavirus is rooted in its distrust of science.

I’d like to note that when I do “num_return_sequences” the answers are the same. That makes sense, but is there a way for me to get separate answers? I don’t believe seed to be built-in with BART.

2 Likes

hi @zanderbush, sure BART should also work for paraphrasing. Just fine-tune it on a paraphrasing dataset.

There’s a small mistake in the way you are using .generate. If you want to do sampling you’ll need to set num_beams to 0 and and do_sample to True . And set do_sample to false and num_beams to >1 for beam search. This post explains how to use generate

1 Like

Thank you! I appreciate your help. I hope I am not probing too much, but I want to make sure I am doing this project in an efficient manner. Would T5 be better suited for paraphrasing?

I’ve only tried T5 for paraphrasing, but BART should also work, you’ll need to experiment and see what works best for your goal. Here’s a T5 paraphrasing project

1 Like

Hello, @zanderbush, do you have an example of how you implemented the training and data pre-processing for this task? I am looking around, but I’m struggling to find good examples on the topic - thank you in advance!

Jumping on this thread a bit, I’m wondering whether a form of paraphrasing might be possible just by doing within-language “translation” and using sampling (top_k, top_p, temperature) to reduce the likelihood of simply reproducing exact quotations. Does that make sense, in lieu of an actual paraphrasing dataset? I’m asking because I have a use case where a dataset does not exist.

Thanks in advance.

Oh, also @valhalla, would it be possible to re-post the T5 paraphrasing project? That link has died.