Is there a standard way to handle leftover batches when using gradient accumulation?

Let’s say that I have the following training loop that uses accumulated gradients (taken from here). Let’s also say that I have a batch size of 4 and want to accumulate gradients for 10 steps, which gives me an “effective” batch size of 40.

model.zero_grad()                                 
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     
    loss = loss_function(predictions, labels)       
    loss = loss / accumulation_steps                
    loss.backward()                                
    if (i+1) % accumulation_steps == 0:          
        optimizer.step()                        
        model.zero_grad()                         
        if (i+1) % evaluation_steps == 0:       
            evaluate_model()                     

I was wondering what we do with gradient accumulation when we have leftover batches in our training loop.

For the above code, we might see this if we have (say) 57 batches in our training step. This would lead to 5 successful training steps of gradient accumulation (corresponding to the first 50 batches), but the last 7 batches would be ignored.

I’m guessing the convention is just to ignore the leftover batches (particularly if you are shuffling the batches in each epoch), but perhaps it might be better to do a training step instead?

Thoughts appreciated.

You should use a step counter that goes over all the training loop instead of the counter step, so that you will finish your batch of epoch 0 during epoch 1 (unless your dataset is pretty small, the probablity of having the same samples twice is not super high).