As far as I am aware, the common rule of thumb to select the batch size is “as big as your hardware can support”. For example, the most recent leaks concerning GPT-4’s training suggest that a staggering batch size of 60M is used. This makes me wonder how an engineer should balance the batch size and gradient accumulation steps hyperparameters. For example, at what point do the potential drawback of increasing gradient accumulation steps outweigh the benefits that are attained by using large batch sizes (I guess this particular question would specifically pertain to the clear performance benefits of large batch sizes vs the possible convergence benefits)?
2 Likes