Pre-train PEGASUS model from scratch

Hi @sgugger ,
I want to do a pre-training PEGASUS model from scratch, can you five me some suggestion?

First, can I do this approach (How to train a new language model from scratch using Transformers and Tokenizers) to train this model from scratch?
Secondly, how can I control the <mask_1> tokens to mask sentences (GSG objective)? And how can I specify the strategy of masking sentences like the selected one in PEGASUS paper?

Thank you very much!

2 Likes

Can anyone give me some suggestion?

I think you can start from here to finetune pegausus on the language you want, you should choose causal language modeling for pegasus if i’m not mistaken

I don’t think you.
Beside MLM objective like BERT-based models, PEGASUS has another special training objective called GSG and that make it powerful for abstractive text summarization.

I want to archive that powerful model for my language :slight_smile:

1 Like

I dont think pre-training Pegasus is supported still.

I would also be interested in this :slight_smile:

@patrickvonplaten
In order to implement the PEGASUS pretraining objective ourselves, could we follow the same approach you suggested for mBART ?
This means by adapting to the objective presented in the paper it would become:

from transformers import PegasusTokenizer, PegasusForConditionalGeneration, PegasusConfig

tok = PegasusTokenizer.from_pretrained("google/pegasus")
model = PegasusForConditionalGeneration(PegasusConfig())

input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."
decoder_input_string = "<s> It is pure white ."
labels_string = "It is pure white . <eos>"

input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
 
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

Naturally, to automate it for pretraining one should implement the mask selection procedure of the dataset (top ROUGE sentences).
Then, the loss we compute would be put in the method _step.
Is this reasonable or I’m missing something ?