I have successfully trained a Whisper model using Seq2SeqTrainer for 40k steps. And I have the checkpoints saved during the training. Now, I want to resume the training from the checkpoint. Seems like this is the way to do so:
trainer.train(resume_from_checkpoint=True)
To clarify, I instantiated the trainer with a new object of Seq2SeqTrainer. And while instantiating that object, I provided a new Seq2SeqTrainingArguments with max_steps=50000
(previously, it was 40000
). Other than that, all the arguments are left untouched (compared to the first run).
One last thing to mention, I’m using an IterableDataset
of size 28024
samples. But again, this is the exact same dataset used for the first run.
When I try to resume the training from the checkpoint by executing the above command, it takes like half an hour, does not make use of my GPU and without showing any progress bar, it finishes without any errors. It just reports:
There seems to be not a single sample in your epoch_iterator, stopping
training at step 40000! This is expected if you're using an IterableDataset
and set num_steps (50000) higher than the number of available samples.
Can someone please help me understand what’s going on here? Why the same dataset of size 28024
samples was good enough to train for 40k steps but not for 50k steps?