Trainer fails to resume training from a checkpoint, claiming there's not enough samples in the dataset

I have successfully trained a Whisper model using Seq2SeqTrainer for 40k steps. And I have the checkpoints saved during the training. Now, I want to resume the training from the checkpoint. Seems like this is the way to do so:


To clarify, I instantiated the trainer with a new object of Seq2SeqTrainer. And while instantiating that object, I provided a new Seq2SeqTrainingArguments with max_steps=50000 (previously, it was 40000). Other than that, all the arguments are left untouched (compared to the first run).

One last thing to mention, I’m using an IterableDataset of size 28024 samples. But again, this is the exact same dataset used for the first run.

When I try to resume the training from the checkpoint by executing the above command, it takes like half an hour, does not make use of my GPU and without showing any progress bar, it finishes without any errors. It just reports:

There seems to be not a single sample in your epoch_iterator, stopping
training at step 40000! This is expected if you're using an IterableDataset
and set num_steps (50000) higher than the number of available samples.

Can someone please help me understand what’s going on here? Why the same dataset of size 28024 samples was good enough to train for 40k steps but not for 50k steps?

I’m not sure if there’s a better solution to this problem or not but at least this let’s you start the training without an error.

In order to continue the training when you dataset is of IterableDataset type, use ignore_data_skip=True:

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(

If you ask me, the fact that resuming the training faces an error when the dataset is IterableDataset is a bug.