GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

This might help.