Fine-tuning GPT2 Family (Small to XL), How should hyperparameters and generation criteria change?

andrewbrown · May 30, 2023, 5:25pm

Hello everyone,

I’m working on a project that includes finetuning the entire GPT2 family (from small to XL) on ~3000 prompt completion pairs of short-ish length (around 150 characters each).

So far, with the same hyperparameters, I am observing much more coherent language generation for my fine-tuning task with GPT2 small in comparison to GPT2XL. Also viewing train loss and eval loss gives no indication of overfitting. I figure that this has to be a function of my training (or sampling) process. Could anybody give me some intuition on why a larger model might perform worse with the same hyperparameters and how I should vary hyperparameters, training process or generation criteria as I increase model size?

My hyperparameters:

epochs: 7 (but with early stopping)
early_stopping_patience: 10 (on eval_loss)
bf16: true
learning_rate: 0.0003
epsilon (1e-7) for AdamW
batch_size: 1 (eval and train)
weight_decay: 0.01

My sampling/generation criteria:

max_length: 300
temperature: 0.6
repetition_penalty: 1.1
top_k: 100
top_p: 1

My intuition here is epochs play a major role. Would it be fair to assume a smaller model needs more instruction (and thus more epochs?). I think that my larger model (GPT2XL) might only need 2 or 3 epochs since it has better language generation from its original pre-trained self than GPT2-small. That being said the train/eval curves do not tell that story.

Thanks in advance!
Andrew

Topic		Replies	Views
Fine tuning and retokenizing Beginners	0	589	May 29, 2022
GPT2 - Training data vs size comparison for GPT2-Small/Medium and XL 🤗Transformers	1	337	February 11, 2025
Fine-tuning gpt2 generates repetive test despte many hyperparameters, gpt-large/xl? Beginners	0	555	November 3, 2020
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2138	February 6, 2023
Analyze the fine tuning result Models	2	34	February 18, 2025

Fine-tuning GPT2 Family (Small to XL), How should hyperparameters and generation criteria change?

Related topics