Domain-specific pre-training of GPT? Help!

I am looking to adapt GPT-2 to generate dialogue utterances in the style of a certain demography/population (let’s call them Ogres). However, there are no large datasets that are both 1) dialogue datasets, and 2) are generated by this target demography.

In the absence of such data, I have been considering a few approaches for data augmentation purposes. Many of those approaches would benefit from a GPT-Ogre, which is at least capable of generating text similar to Ogres, if not necessarily dialogic.

Approach 1


For this, I am considering performing additional pre-training of, say, GPT-2 on some medium-sized corpora generated by Ogres. This sounds like something that should have been done by a lot of people for a lot of different things by now, but except for some papers that have tried to do this with BERT in the Medical domain, I was not able to find any papers/GitHub repos that have done this with additional unsupervised pre-training GPT.

It would be helpful if someone could point me to some resources around this as I feel the space of hyperparameters to figure out the best learning rate, etc. is too large, and if somebody has already done this, it would be easy to replicate it.

Approach 2


There are some dialogue-specific GPT models such as DialoGPT that have been fine-tuned (in a supervised way; mind you, not pretrained in an unsupervised way). However, it is not in the Ogre style. I am wondering if it’s a ridiculous idea to perform additional pre-training of a fine-tuned GPT-2 model?

Random idea but if I have many dialogues that share common syntax and grammar in majority except for small differences you can use the population identity as part of the input to a model, train it, then prompt the model using “Ogres” or whatever in the same manner.

on a another note - if you want to do further pretraining the common wisdom is to reduce the learning rate quite a bit to do so