Hello,
I am using the transformer’s Seq2SeqTrainer
class to train a seq2seq model.
I am using the following parameters in the Seq2SeqTrainingArguments
class, amongst others:
- gradient accumulation steps: 8
- batch size: 2
- dataloader_drop_last: False
I am also using 2 GPUs for the training.
Therefore the total effective batch size is 8 * 2 * 2 = 32.
I have noticed the following: if I train the model with a dataset comprising 32 samples, the model performs 1 total optimization step per epoch (as expected), if I train with a dataset comprising 64 samples, the model performs 2 total optimization steps (as expected); however, if I train with a dataset comprising 60 samples (or any number between 33 and 63), the model performs 1 total optimization step, whereas I would expect there to be 2 optimization steps, since one optimization step can only process 32 samples and I set drop_last=False
.
See extract console printout below
***** Running training *****
Num examples = 60
Num Epochs = 1
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 8
Total optimization steps = 1
Can anyone shine light as to why this is happening? Is this expected, or am I doing anything wrong?
Happy to provide a short reproducible code example if required.