Pre-train PEGASUS model from scratch

ithieund · March 18, 2021, 12:04am

Hi @sgugger ,
I want to do a pre-training PEGASUS model from scratch, can you five me some suggestion?

First, can I do this approach (How to train a new language model from scratch using Transformers and Tokenizers) to train this model from scratch?
Secondly, how can I control the <mask_1> tokens to mask sentences (GSG objective)? And how can I specify the strategy of masking sentences like the selected one in PEGASUS paper?

Thank you very much!

ithieund · March 21, 2021, 5:38pm

Can anyone give me some suggestion?

eddie96 · March 22, 2021, 3:47pm

https://github.com/huggingface/transformers/tree/master/examples/language-modeling

I think you can start from here to finetune pegausus on the language you want, you should choose causal language modeling for pegasus if i’m not mistaken

ithieund · March 23, 2021, 4:34am

I don’t think you.
Beside MLM objective like BERT-based models, PEGASUS has another special training objective called GSG and that make it powerful for abstractive text summarization.

I want to archive that powerful model for my language

jominmathew · March 23, 2021, 5:24am

I dont think pre-training Pegasus is supported still.

Skylixia · April 16, 2021, 7:59pm

I would also be interested in this

Skylixia · April 19, 2021, 7:00pm

@patrickvonplaten
In order to implement the PEGASUS pretraining objective ourselves, could we follow the same approach you suggested for mBART ?
This means by adapting to the objective presented in the paper it would become:

from transformers import PegasusTokenizer, PegasusForConditionalGeneration, PegasusConfig

tok = PegasusTokenizer.from_pretrained("google/pegasus")
model = PegasusForConditionalGeneration(PegasusConfig())

input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."
decoder_input_string = "<s> It is pure white ."
labels_string = "It is pure white . <eos>"

input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
 
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

Naturally, to automate it for pretraining one should implement the mask selection procedure of the dataset (top ROUGE sentences).
Then, the loss we compute would be put in the method _step.
Is this reasonable or I’m missing something ?

patrickvonplaten · April 25, 2021, 6:07pm

Answered here: Pretrain PEGASUS from scratch · Issue #8536 · huggingface/transformers · GitHub

Topic		Replies	Views
How to do domain adaptive pretraining of Pegasus? Models	0	395	July 13, 2021
BART question, it seems that pretraining is not work for a small model? Research	6	563	August 3, 2020
Simple Model to rewrite/paraphrase Beginners	7	255	March 19, 2025
Gap Sentences Generation using Pegasus Beginners	1	377	March 6, 2024
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1378	July 22, 2023

Pre-train PEGASUS model from scratch

Related topics