BART Paraphrasing

zanderbush · July 15, 2020, 9:57pm

I’ve been using BART to summarize, and I have noticed some of the outputs resembling paraphrases.

Is there a way for me to build on this, and use the model for paraphrasing primarily?

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
import torch

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
device = torch.device('cpu')

text = "At the core of the United States' mismanagement of the Coronavirus lies its distrust of science"


preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

summary_ids = model.generate(tokenized_text,
                        num_beams=10,
                        no_repeat_ngram_size=1,
                        min_length=10,
                        num_return_sequences = 2,
                        max_length=20,
                        top_k = 100,
                        early_stopping=True)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
output1 = tokenizer.decode(summary_ids[1], skip_special_tokens=True)

Summarized Text: The United States' mismanagement of the Coronavirus is rooted in its distrust of science.

I’d like to note that when I do “num_return_sequences” the answers are the same. That makes sense, but is there a way for me to get separate answers? I don’t believe seed to be built-in with BART.

valhalla · July 16, 2020, 5:32am

hi @zanderbush, sure BART should also work for paraphrasing. Just fine-tune it on a paraphrasing dataset.

There’s a small mistake in the way you are using .generate. If you want to do sampling you’ll need to set num_beams to 0 and and do_sample to True . And set do_sample to false and num_beams to >1 for beam search. This post explains how to use generate

zanderbush · July 16, 2020, 8:08pm

Thank you! I appreciate your help. I hope I am not probing too much, but I want to make sure I am doing this project in an efficient manner. Would T5 be better suited for paraphrasing?

valhalla · July 17, 2020, 4:26am

I’ve only tried T5 for paraphrasing, but BART should also work, you’ll need to experiment and see what works best for your goal. Here’s a T5 paraphrasing project

Guen · November 29, 2021, 7:58pm

Hello, @zanderbush, do you have an example of how you implemented the training and data pre-processing for this task? I am looking around, but I’m struggling to find good examples on the topic - thank you in advance!

jbmaxwell · February 18, 2022, 6:06pm

Jumping on this thread a bit, I’m wondering whether a form of paraphrasing might be possible just by doing within-language “translation” and using sampling (top_k, top_p, temperature) to reduce the likelihood of simply reproducing exact quotations. Does that make sense, in lieu of an actual paraphrasing dataset? I’m asking because I have a use case where a dataset does not exist.

Thanks in advance.

jbmaxwell · February 18, 2022, 6:09pm

Oh, also @valhalla, would it be possible to re-post the T5 paraphrasing project? That link has died.

Topic		Replies	Views
Train Bart for Conditional Generation (e.g. Summarization) Models	14	17140	November 22, 2023
Is this the right way prompt summarization with BART? 🤗Transformers	1	2070	March 18, 2023
Finetuning BART for Abstractive Text Summarisation Beginners	1	5211	September 9, 2024
Bart Large CNN summarization Beginners	6	5532	February 5, 2021
How to get 'sequences_scores' from 'scores' in 'generate()' method Beginners	6	6199	May 2, 2023

BART Paraphrasing

Related topics