Hi @sgugger ,
I want to do a pre-training PEGASUS model from scratch, can you five me some suggestion?
First, can I do this approach (How to train a new language model from scratch using Transformers and Tokenizers) to train this model from scratch?
Secondly, how can I control the <mask_1> tokens to mask sentences (GSG objective)? And how can I specify the strategy of masking sentences like the selected one in PEGASUS paper?
Thank you very much!
2 Likes
Can anyone give me some suggestion?
https://github.com/huggingface/transformers/tree/master/examples/language-modeling
I think you can start from here to finetune pegausus on the language you want, you should choose causal language modeling for pegasus if i’m not mistaken
I don’t think you.
Beside MLM objective like BERT-based models, PEGASUS has another special training objective called GSG and that make it powerful for abstractive text summarization.
I want to archive that powerful model for my language
1 Like
I dont think pre-training Pegasus is supported still.
I would also be interested in this
@patrickvonplaten
In order to implement the PEGASUS pretraining objective ourselves, could we follow the same approach you suggested for mBART ?
This means by adapting to the objective presented in the paper it would become:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, PegasusConfig
tok = PegasusTokenizer.from_pretrained("google/pegasus")
model = PegasusForConditionalGeneration(PegasusConfig())
input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."
decoder_input_string = "<s> It is pure white ."
labels_string = "It is pure white . <eos>"
input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]
Naturally, to automate it for pretraining one should implement the mask selection procedure of the dataset (top ROUGE sentences).
Then, the loss we compute would be put in the method _step.
Is this reasonable or I’m missing something ?