How to fine-tune GPT on my own data for text generation

mikkelyo · January 17, 2022, 3:20pm

I’ve been attempting to fine tune GPT on my own data, following the example from the huggingface “fine tuning a model” part of the course. I’ve had no problem following the examples from the course, or fine tuning other models for different tasks.

Listing the things that I’ve tried and the errors that I’ve had would take too long, since I’ve been stuck on this all day - so I was wondering if anyone has a python-code-snippet where they accomplished this, so I could borrow some of the ideas and understanding for my own script.

I apologize if it’s a vague question but I’m so clueless I don’t know where else to start.

What I think I’ve learned:
The GPT models don’t want a padding token
You should remove the [“text”] part before feeding the tokenized data to the models

I’ve enclosed an image of my current code, the train_dataloader raises the following error in this version, but I have no idea if I’m on the right track:
“ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).”
The data in the .txt is just:
\n
\n
… etc
and it seems to generate correct inputs with the tokenizer.

Topic		Replies	Views
Finetune GPT2 in tensorflow on custom data example programmatically Beginners	0	491	July 23, 2020
GPT2 finetuning for text generation is getting overfitted Beginners	0	1118	August 27, 2021
Fine-tuning GPT2 for text-generation with TensorFlow Beginners	4	5716	July 24, 2022
Finetuning T5 on custom data Models	0	1070	November 13, 2020
Fine tuning and retokenizing Beginners	0	595	May 29, 2022

How to fine-tune GPT on my own data for text generation

Related topics