Fine-tuning GPT2 Family (Small to XL), How should hyperparameters and generation criteria change?

Hello everyone,

I’m working on a project that includes finetuning the entire GPT2 family (from small to XL) on ~3000 prompt completion pairs of short-ish length (around 150 characters each).

So far, with the same hyperparameters, I am observing much more coherent language generation for my fine-tuning task with GPT2 small in comparison to GPT2XL. Also viewing train loss and eval loss gives no indication of overfitting. I figure that this has to be a function of my training (or sampling) process. Could anybody give me some intuition on why a larger model might perform worse with the same hyperparameters and how I should vary hyperparameters, training process or generation criteria as I increase model size?

My hyperparameters:

  • epochs: 7 (but with early stopping)
  • early_stopping_patience: 10 (on eval_loss)
  • bf16: true
  • learning_rate: 0.0003
  • epsilon (1e-7) for AdamW
  • batch_size: 1 (eval and train)
  • weight_decay: 0.01

My sampling/generation criteria:

  • max_length: 300
  • temperature: 0.6
  • repetition_penalty: 1.1
  • top_k: 100
  • top_p: 1

My intuition here is epochs play a major role. Would it be fair to assume a smaller model needs more instruction (and thus more epochs?). I think that my larger model (GPT2XL) might only need 2 or 3 epochs since it has better language generation from its original pre-trained self than GPT2-small. That being said the train/eval curves do not tell that story.

Thanks in advance!
Andrew