Fine tuning and retokenizing


I’m trying to fine tune GPT-2 for text generation on a custom dataset I gathered. The dataset is around 35MB, which I think to be sufficient in size.

I have loaded the dataset, GPT-2’s tokenizer, then retrained the tokenizer to this dataset (as there is some vocabulary in it that is a little unusual), loaded a pretrained GPT-2 model and trained it. I have tried multiple learning rates and epochs and other hyperparameters, but the results are always the same:

  1. At low epochs, the conditional text generation is completely incoherent, which I think is underfitting behavior
  2. At higher epochs, the conditional text generation is regurgitating verbatim text from the dataset, which I think is overfitting behavior.

There seems to be no middle ground, the moment the text generation is coherent, it’s because it is outputting text directly from the dataset, so the model is not generalizing at all.

After thinking about this for a while, is the issue here that I am retraining the tokenizer to my dataset?
If so, is there a way to insert some custom vocab into GPT-2’s associated tokenizer while also not running into this problem? Or will the tokenizer be able to handle some of the vocabulary I am concerned about?

For context, the uncommon vocabulary consists mostly of acronyms, the rest of the vocab is just English.

Can you utilize a pretrained GPT-2 for text generation and fine tune it to a smaller (35MB) custom dataset whilst also re-training its tokenizer?

If not:
Is there a way append custom vocabulary to the associated tokenizer of the pretrained model in an effective way?

I just tried using the pretrained tokenizer and it works now, I think retraining the tokenizer was definitely the source of the issue!