I’d like to continue pretraining Pegasus (because see here: [2004.10964] Don't Stop Pretraining: Adapt Language Models to Domains and Tasks), and I’m trying to see if I can do this with Huggingface.
The pretraining objective of Pegasus is gap sentence generation. Hence, I am wondering if I can achieve this as follows:
Given unlabelled documents D=[d_1,d_2,…,d_n], take each d_i and extract m “important” sentences", x_1,…,x_m. Concatenate sentences x_1,…,x_m to obtain s_i, and use s_i as an approximation of a summary for d_i. Remove each x_j from d_i to obtain z_i. Now the goal is to obtain s_i from z_i, to approximate the task of summarising d_i. Then, we have a labelled dataset [(z_1, s_1), (z_2, s_2),…,(z_n, s_n)]. Now perform fine-tuning on this dataset, predicting s_i from z_i.
After this, I would fine-tune using my dataset of documents with real human written summaries.
Would this lead to the domain adaptive pretraining I am seeking, or does this give me something else?