Is there a standard way to handle leftover batches when using gradient accumulation?

tomroth1001 · November 22, 2021, 3:40am

Let’s say that I have the following training loop that uses accumulated gradients (taken from here). Let’s also say that I have a batch size of 4 and want to accumulate gradients for 10 steps, which gives me an “effective” batch size of 40.

model.zero_grad()                                 
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     
    loss = loss_function(predictions, labels)       
    loss = loss / accumulation_steps                
    loss.backward()                                
    if (i+1) % accumulation_steps == 0:          
        optimizer.step()                        
        model.zero_grad()                         
        if (i+1) % evaluation_steps == 0:       
            evaluate_model()

I was wondering what we do with gradient accumulation when we have leftover batches in our training loop.

For the above code, we might see this if we have (say) 57 batches in our training step. This would lead to 5 successful training steps of gradient accumulation (corresponding to the first 50 batches), but the last 7 batches would be ignored.

I’m guessing the convention is just to ignore the leftover batches (particularly if you are shuffling the batches in each epoch), but perhaps it might be better to do a training step instead?

Thoughts appreciated.

sgugger · November 22, 2021, 12:30pm

You should use a step counter that goes over all the training loop instead of the counter step, so that you will finish your batch of epoch 0 during epoch 1 (unless your dataset is pretty small, the probablity of having the same samples twice is not super high).

Topic		Replies	Views
Questions about steps with gradient accumulation Beginners	1	1027	July 19, 2023
Batch size, gradient accumulation steps for Linear schedule Models	0	715	May 1, 2021
What is the limit of grad accumulation? Intermediate	2	2912	May 4, 2021
Gradient accumulation: should I duplicate data? 🤗Transformers	7	1014	February 1, 2021
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2220	December 31, 2023

Is there a standard way to handle leftover batches when using gradient accumulation?

Related topics