Finding good batch size and learning rate for fine tuning

I’ve been playing around with fastai and huggingface transformers in the past months (mostly doing fine-tuning multi-class classification using German texts) but I have always wondered how to find a good batch size and learning rate without doing extensive hyperparameter search.

Well, for experimentation and learning, I’ve been running a lot of hyperparameter searches (>1000 trials) on the German gnad10 dataset on Colab to find the model with the best performance. But I can not do the same kind of extensive hyperparameter search on the larger dataset I have at work, mainly because it takes too long and it would cost a lot of money.

Hence, I’m struggling to find a good approach for identifying good hyperparameters for batch size and learning rate with max 10-20 trials runs when using a real world dataset and limited resources.

I would appreciate any suggestions for practical solutions and also references to related articles, like this one.

Here is what I have learned so far:

  • a good choice for batch size and learning rate has the most impact on model performance
  • batch size and learning rate depend on the dataset, the used architecture/model and the number of epochs
  • batch size and learning rate are dependent (in my experiments a bigger batch size needs a higher learning rate and vice versa)
  • the default batch size in huggingface’s TrainingArguments is 8. But this did not work well in my experiments. I had to increase it to 32 or 64, also using gradient accumulation as such batch sizes did not fit in the colab GPU memory.

I look forward hearing about your experiences.