How to do domain adaptive pretraining of Pegasus?

JakeMSc · July 13, 2021, 2:15pm

I’d like to continue pretraining Pegasus (because see here: [2004.10964] Don't Stop Pretraining: Adapt Language Models to Domains and Tasks), and I’m trying to see if I can do this with Huggingface.

The pretraining objective of Pegasus is gap sentence generation. Hence, I am wondering if I can achieve this as follows:

Given unlabelled documents D=[d_1,d_2,…,d_n], take each d_i and extract m “important” sentences", x_1,…,x_m. Concatenate sentences x_1,…,x_m to obtain s_i, and use s_i as an approximation of a summary for d_i. Remove each x_j from d_i to obtain z_i. Now the goal is to obtain s_i from z_i, to approximate the task of summarising d_i. Then, we have a labelled dataset [(z_1, s_1), (z_2, s_2),…,(z_n, s_n)]. Now perform fine-tuning on this dataset, predicting s_i from z_i.

After this, I would fine-tune using my dataset of documents with real human written summaries.

Would this lead to the domain adaptive pretraining I am seeking, or does this give me something else?

Topic		Replies	Views
Pre-train PEGASUS model from scratch Models	7	2824	April 25, 2021
Use Pegasus in Huggingface for a downstream classification task Beginners	0	265	October 10, 2022
How to pre-train the language model in huggingface? 🤗Transformers	0	399	February 2, 2023
Finetuning Pegasus for summarization by splitting the encoder 🤗Transformers	0	230	March 17, 2023
Gap Sentences Generation using Pegasus Beginners	1	377	March 6, 2024

How to do domain adaptive pretraining of Pegasus?

Related topics