Trainer fails to resume training from a checkpoint, claiming there's not enough samples in the dataset

mehran · May 26, 2023, 3:33pm

I have successfully trained a Whisper model using Seq2SeqTrainer for 40k steps. And I have the checkpoints saved during the training. Now, I want to resume the training from the checkpoint. Seems like this is the way to do so:

trainer.train(resume_from_checkpoint=True)

To clarify, I instantiated the trainer with a new object of Seq2SeqTrainer. And while instantiating that object, I provided a new Seq2SeqTrainingArguments with max_steps=50000 (previously, it was 40000). Other than that, all the arguments are left untouched (compared to the first run).

One last thing to mention, I’m using an IterableDataset of size 28024 samples. But again, this is the exact same dataset used for the first run.

When I try to resume the training from the checkpoint by executing the above command, it takes like half an hour, does not make use of my GPU and without showing any progress bar, it finishes without any errors. It just reports:

There seems to be not a single sample in your epoch_iterator, stopping
training at step 40000! This is expected if you're using an IterableDataset
and set num_steps (50000) higher than the number of available samples.

Can someone please help me understand what’s going on here? Why the same dataset of size 28024 samples was good enough to train for 40k steps but not for 50k steps?

mehran · May 29, 2023, 8:59pm

I’m not sure if there’s a better solution to this problem or not but at least this let’s you start the training without an error.

In order to continue the training when you dataset is of IterableDataset type, use ignore_data_skip=True:

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    ignore_data_skip=True,
    ...
)

If you ask me, the fact that resuming the training faces an error when the dataset is IterableDataset is a bug.

Topic		Replies	Views
Resume_from_checkpoint Models	1	2340	June 25, 2024
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (5000000) higher than the number of available samples Beginners	2	1668	April 19, 2023
Resume training from checkpoint Beginners	1	3032	January 5, 2023
Cannot Resume Training Beginners	1	1374	December 15, 2020
Resume Training, but reset epochs 🤗Transformers	0	937	September 16, 2022

Trainer fails to resume training from a checkpoint, claiming there's not enough samples in the dataset

Related topics