I’ve been attempting to fine tune GPT on my own data, following the example from the huggingface “fine tuning a model” part of the course. I’ve had no problem following the examples from the course, or fine tuning other models for different tasks.
Listing the things that I’ve tried and the errors that I’ve had would take too long, since I’ve been stuck on this all day - so I was wondering if anyone has a python-code-snippet where they accomplished this, so I could borrow some of the ideas and understanding for my own script.
I apologize if it’s a vague question but I’m so clueless I don’t know where else to start.
What I think I’ve learned:
The GPT models don’t want a padding token
You should remove the [“text”] part before feeding the tokenized data to the models
I’ve enclosed an image of my current code, the train_dataloader raises the following error in this version, but I have no idea if I’m on the right track:
“ValueError: text input must of type
str (single example),
List[str] (batch or single pretokenized example) or
List[List[str]] (batch of pretokenized examples).”
The data in the .txt is just:
and it seems to generate correct inputs with the tokenizer.