Let’s say that I have the following training loop that uses accumulated gradients (taken from here). Let’s also say that I have a batch size of 4 and want to accumulate gradients for 10 steps, which gives me an “effective” batch size of 40.
model.zero_grad()
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs)
loss = loss_function(predictions, labels)
loss = loss / accumulation_steps
loss.backward()
if (i+1) % accumulation_steps == 0:
optimizer.step()
model.zero_grad()
if (i+1) % evaluation_steps == 0:
evaluate_model()
I was wondering what we do with gradient accumulation when we have leftover batches in our training loop.
For the above code, we might see this if we have (say) 57 batches in our training step. This would lead to 5 successful training steps of gradient accumulation (corresponding to the first 50 batches), but the last 7 batches would be ignored.
I’m guessing the convention is just to ignore the leftover batches (particularly if you are shuffling the batches in each epoch), but perhaps it might be better to do a training step instead?
Thoughts appreciated.