Pegasus Questions

Yes. and PegasusTokenizer.from_pretrained('google/pegasus-arxiv') should truncate for you.

About 1-5% of the time I run Pegasus (’‘google/pegasus-xsum’) I get repeating strings of various length. I’ve seen this before on other seq2seq models using LSTMs, but is there a way to avoid this behavior? I assume I’m not the first to notice this with Pegasus.

1 Like

You’re the first one to report it! Could we see and example input output?

Maybe I’m doing something wrong then. Here are a couple of examples:
Let me know if you get somethinng different.

text_input:
jobs on hold. i do apologize about the wait. how can i help out. i think i figured it out now. i was trying to make a report of all jobs on hold. but thank you. happy wednesday'''

Pegasus output:
jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on hold, jobs on

text_input:
invoice issue. how can i help you today. why do i see balance. next to bill to. got ya. you seeing that balance because the job that payment is tied to is completed yet. once the job is complete the balance will go down to. have a good one. you as well

Pegasus output:
how can i help you with an invoice issue, how can i help you with an invoice issue, how can i help you with an invoice issue, how can i help you with an invoice

text_input:
quickbooks desktop integration. how may i help you today. yes i need help. may i know what is occurring today. sorry going to have to chat back

Pegasus output:
quickbooks desktop integration - how may i help you today.,,,,,,,,,,,,,,,,,,,,,,,,,,```

try passing min_length=0 to generate

1 Like

Thanks, that definitely helped minimize the cases. I usually have it set to 12-15 because it usually gives better and more meaningful summaries. I guess the tradeoff is having the model produce the repeated strings.

Hi,

What does google/pegasus-pubmed do if you feed it an input longer than 1024 words like a biomedical article?

Thanks! I added min_length=0 to the tokenizer.batch_decode(…, min_length=0). Is this the right parameter to be setting? It doesn’t seem to solve the recurrent answer I see. I’m wondering where else I can investigate (relatively new to customizing models…) Thank you for your help!

ex. Pegasus Summary
[“Today we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about infrastructure, we’re talking about”]

I’m trying to exactly understand the differences among the pre-trained Pegasus models.

As far as I understood:

  • models like google/pegasus-* (e.g google/pegasus-xsum) are base models
  • all base models are fine-tuned on a dataset (e.g. xsum in the previous example)
  • google/pegasus-large is only pretrained (on C4 and Newsroom?)
  • sshleifer/distill-pegasus-* and sshleifer/student_pegasus-* are distilled models
  • google/bigbird-pegasus-large-* use the bigbird attention mechanism.

My questions are the following:

  1. Is my understanding correct?
  2. Is there any way to get a base Pegasus which is not fine-tuned on a downstream dataset?
  3. Is google/pegasus-multi_news multilingual?
  4. As for the distilled models, what is the difference between a distill-* and a student-* model? And what do the two numbers represent (e.g. in sshleifer/distill-pegasus-cnn-16-4)?

Thank you very much and thanks for all the work.

Hello, my question is, can we control the length of the output summary? is there any parameter that controls the length? Can we produce summaries with length larger than the max_length parameter?
Currenntry im using ‘google/pegasus-multi_news’

Thanks is advance