Fine tuning and retokenizing

Cyclos · May 29, 2022, 6:53pm

Hello,

I’m trying to fine tune GPT-2 for text generation on a custom dataset I gathered. The dataset is around 35MB, which I think to be sufficient in size.

I have loaded the dataset, GPT-2’s tokenizer, then retrained the tokenizer to this dataset (as there is some vocabulary in it that is a little unusual), loaded a pretrained GPT-2 model and trained it. I have tried multiple learning rates and epochs and other hyperparameters, but the results are always the same:

At low epochs, the conditional text generation is completely incoherent, which I think is underfitting behavior
or
At higher epochs, the conditional text generation is regurgitating verbatim text from the dataset, which I think is overfitting behavior.

There seems to be no middle ground, the moment the text generation is coherent, it’s because it is outputting text directly from the dataset, so the model is not generalizing at all.

After thinking about this for a while, is the issue here that I am retraining the tokenizer to my dataset?
If so, is there a way to insert some custom vocab into GPT-2’s associated tokenizer while also not running into this problem? Or will the tokenizer be able to handle some of the vocabulary I am concerned about?

For context, the uncommon vocabulary consists mostly of acronyms, the rest of the vocab is just English.

TL;DR
Can you utilize a pretrained GPT-2 for text generation and fine tune it to a smaller (35MB) custom dataset whilst also re-training its tokenizer?

If not:
Is there a way append custom vocabulary to the associated tokenizer of the pretrained model in an effective way?

Edit:
I just tried using the pretrained tokenizer and it works now, I think retraining the tokenizer was definitely the source of the issue!

Topic		Replies	Views
Finetune GPT2 in tensorflow on custom data example programmatically Beginners	0	487	July 23, 2020
GPT-2 fine-tuning Beginners	0	1616	June 12, 2023
How to fine-tune GPT on my own data for text generation Beginners	0	2188	January 17, 2022
Fine-tune, or train from scratch? Beginners	6	3462	September 16, 2020
Can I retrain GPT-2 tokeniser on Chinese data and use it with GPT-2 XL or other models to create a Chinese-speaking model? 🤗Tokenizers	0	23	August 14, 2024

Fine tuning and retokenizing

Related topics