Learning rate, LR scheduler and optimiser choice for fine-tuning GPT2

I know the best choice is different depending on the actual dataset that we are fine-tuning on but I am just curious to know what combinations of learning rate, LR scheduler and optimiser have you guys found to be a good combination to train with in general? I am currently using AdamW, CosineAnnealingWarmRestarts, with a learning rate going from 0.002 to 0.0001, restarting at the end of each epoch.

You can refer to TrainingArguments to look at the defaults. Link. They usually work well.