`Trainer` seems to drop last incomplete batch even if `Dataloader` is set with drop_last=False

I am using the transformer’s Seq2SeqTrainer class to train a seq2seq model.
I am using the following parameters in the Seq2SeqTrainingArguments class, amongst others:

  • gradient accumulation steps: 8
  • batch size: 2
  • dataloader_drop_last: False

I am also using 2 GPUs for the training.
Therefore the total effective batch size is 8 * 2 * 2 = 32.
I have noticed the following: if I train the model with a dataset comprising 32 samples, the model performs 1 total optimization step per epoch (as expected), if I train with a dataset comprising 64 samples, the model performs 2 total optimization steps (as expected); however, if I train with a dataset comprising 60 samples (or any number between 33 and 63), the model performs 1 total optimization step, whereas I would expect there to be 2 optimization steps, since one optimization step can only process 32 samples and I set drop_last=False.

See extract console printout below

***** Running training *****
  Num examples = 60
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 1

Can anyone shine light as to why this is happening? Is this expected, or am I doing anything wrong?

Happy to provide a short reproducible code example if required.