I know the best choice is different depending on the actual dataset that we are fine-tuning on but I am just curious to know what combinations of learning rate, LR scheduler and optimiser have you guys found to be a good combination to train with in general? I am currently using AdamW, CosineAnnealingWarmRestarts, with a learning rate going from 0.002 to 0.0001, restarting at the end of each epoch.
You can refer to
TrainingArguments to look at the defaults. Link. They usually work well.