Domain-specific pre-training of GPT? Help!

msamogh · March 17, 2022, 7:07pm

I am looking to adapt GPT-2 to generate dialogue utterances in the style of a certain demography/population (let’s call them Ogres). However, there are no large datasets that are both 1) dialogue datasets, and 2) are generated by this target demography.

In the absence of such data, I have been considering a few approaches for data augmentation purposes. Many of those approaches would benefit from a GPT-Ogre, which is at least capable of generating text similar to Ogres, if not necessarily dialogic.

Approach 1

==========

For this, I am considering performing additional pre-training of, say, GPT-2 on some medium-sized corpora generated by Ogres. This sounds like something that should have been done by a lot of people for a lot of different things by now, but except for some papers that have tried to do this with BERT in the Medical domain, I was not able to find any papers/GitHub repos that have done this with additional unsupervised pre-training GPT.

It would be helpful if someone could point me to some resources around this as I feel the space of hyperparameters to figure out the best learning rate, etc. is too large, and if somebody has already done this, it would be easy to replicate it.

Approach 2

==========

There are some dialogue-specific GPT models such as DialoGPT that have been fine-tuned (in a supervised way; mind you, not pretrained in an unsupervised way). However, it is not in the Ogre style. I am wondering if it’s a ridiculous idea to perform additional pre-training of a fine-tuned GPT-2 model?

eloaf · March 18, 2022, 8:57pm

Random idea but if I have many dialogues that share common syntax and grammar in majority except for small differences you can use the population identity as part of the input to a model, train it, then prompt the model using “Ogres” or whatever in the same manner.

on a another note - if you want to do further pretraining the common wisdom is to reduce the learning rate quite a bit to do so

Topic		Replies	Views
Fine tuning and retokenizing Beginners	0	588	May 29, 2022
Model Recommendations Beginners	0	1170	January 4, 2023
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2101	July 7, 2021
Fine Tuning GPT2 for machine translation 🤗Transformers	1	4767	May 2, 2021
Fine tuning GPT2 on persona chat dataset outputs gibberish Models	1	2720	April 14, 2021

Domain-specific pre-training of GPT? Help!

Related topics