I’m working on a project that includes finetuning the entire GPT2 family (from small to XL) on ~3000 prompt completion pairs of short-ish length (around 150 characters each).
So far, with the same hyperparameters, I am observing much more coherent language generation for my fine-tuning task with GPT2 small in comparison to GPT2XL. Also viewing train loss and eval loss gives no indication of overfitting. I figure that this has to be a function of my training (or sampling) process. Could anybody give me some intuition on why a larger model might perform worse with the same hyperparameters and how I should vary hyperparameters, training process or generation criteria as I increase model size?
- epochs: 7 (but with early stopping)
- early_stopping_patience: 10 (on eval_loss)
- bf16: true
- learning_rate: 0.0003
- epsilon (1e-7) for AdamW
- batch_size: 1 (eval and train)
- weight_decay: 0.01
My sampling/generation criteria:
- max_length: 300
- temperature: 0.6
- repetition_penalty: 1.1
- top_k: 100
- top_p: 1
My intuition here is epochs play a major role. Would it be fair to assume a smaller model needs more instruction (and thus more epochs?). I think that my larger model (GPT2XL) might only need 2 or 3 epochs since it has better language generation from its original pre-trained self than GPT2-small. That being said the train/eval curves do not tell that story.
Thanks in advance!