Calculate the probability of a given sequence for a seq2seq model

Given a seq2seq paraphrase model pp_model, a tokenizer pp_tokenizer, a piece of text and a few pre-determined paraphrases pp_1, pp_2, pp_3

pp_model = AutoModelForSeq2SeqLM.from_pretrained("tuner007/pegasus_paraphrase")
pp_tokenizer = AutoTokenizer.from_pretrained("tuner007/pegasus_paraphrase")

text  = "I like to go to the beach."
pp_1  = "I really enjoy going to the beach."
pp_2  = "The beach is somewhere I like to go."
pp_3  = "I like driving to the beach and watching the waves flow."

how can I calculate the generation probability of pp_model generating each paraphrase?

For context, I need this to work out the KL-divergence between a model and a reference model, using the formula KL = E_{x \sim p_{model}} [\log p_{model}(x) - \log p_{refmodel}(x)] (e.g. as done here).

1 Like