Fine-tuning GPT-J for conversations


I would like to fine-tune a GPT-J model for conversations that is running locally on my machine. There are two models that I can use: The original GPT-J model or the Quantized EleutherAI/gpt-j-6b with 8-bit weights. I have a machine with a 24GB GPU (RTX 3090). How much GPU memory would the original GPT-J model need for fine-tuning and for inference? As far as I understand, the main advantage of the quantized GPT-J is that it needs less GPU memory.

Second, I would like to fine-tune the GPT-J model on conversation datasets such as daily dialog, Blended Skill Talk (but without different personas), Multi-Session Chat and Wizard of the Internet.

In general, for fine-tuning GPT-J should I just format the conversation in the following way?

Person_a: Say , Jim , how about going for a few beers after dinner ?
Person_b: You know that is tempting but is really not good for our fitness.
Person_a: What do you mean ? It will help us to relax .

Or are there any other delimiters such as <|endoftext|> necessary? During inference, when the user is for example sending “Hello, how are you?” to the chatbot, I would then format it as “Person_a: Hello, how are you? Person_b:”.

Finally, for fine-tuning I see the following options:

  1. Fine-tuning on only one conversation dataset.
  2. Fine-tuning on several conversation dataset and just stacking the datasets.
  3. Fine-tuning on the first dataset, then fine-tuning on the second dataset and so on.

Which of these three options is best?

I’m happy about any input. Thank you very much in advance.

You could use deepspeed to reduce the system requirements needed for training.