I am using the transformer’s
Seq2SeqTrainer class to train a seq2seq model.
I am using the following parameters in the
Seq2SeqTrainingArguments class, amongst others:
- gradient accumulation steps: 8
- batch size: 2
- dataloader_drop_last: False
I am also using 2 GPUs for the training.
Therefore the total effective batch size is 8 * 2 * 2 = 32.
I have noticed the following: if I train the model with a dataset comprising 32 samples, the model performs 1 total optimization step per epoch (as expected), if I train with a dataset comprising 64 samples, the model performs 2 total optimization steps (as expected); however, if I train with a dataset comprising 60 samples (or any number between 33 and 63), the model performs 1 total optimization step, whereas I would expect there to be 2 optimization steps, since one optimization step can only process 32 samples and I set
See extract console printout below
***** Running training ***** Num examples = 60 Num Epochs = 1 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 32 Gradient Accumulation steps = 8 Total optimization steps = 1
Can anyone shine light as to why this is happening? Is this expected, or am I doing anything wrong?
Happy to provide a short reproducible code example if required.