Setting max_steps with IterableDataset still errors

urjit · April 18, 2023, 7:42am

Hi, I can’t figure out how to train using an IterableDataset. I keep running errors with the train step. If I don’t set the max_steps training arg, I get an error claiming that the dataset doesn’t have a length. On the other hand, with that set to max_steps=150_000 I see this error.

Any pointers?

There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (1500000) higher than the number of available samples.
{'eval_runtime': 3.9334, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0}
  0%|                                                                                                                                                             | 0/1500000 [00:07<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/app/urjit/python/./run.py", line 33, in <module>
    ModelRunner()
  File "/home/ubuntu/app/urjit/python/./run.py", line 23, in __init__
    getattr(self, f"model_{args.model}")()
  File "/home/ubuntu/app/urjit/python/./run.py", line 29, in model_dialogpt
    dialogpt.main_dialogpt(args)
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 269, in main_dialogpt
    __dry_run__(args.input, args.train, args.batchsize,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 201, in __dry_run__
    setup_and_train(tokenizer=tokenizer,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 100, in setup_and_train
    train_loop(model=model, training_args=training_args,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 89, in train_loop
    trainer.train()
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 2055, in _inner_training_loop
    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero

tanthinhdt · July 31, 2023, 9:03am

Hi, I am also facing the same problem as yours

    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero

Have you fix this error yet? If you’ve already fixed it, can you help me out please?

shwetha-97 · August 23, 2023, 11:30pm

I am also facing the exact same error and there aren’t many resources on the internet that I am finding for this, could you please share what you did to solve it? I would really appreciate the help!

Tekrondo · August 28, 2023, 9:49am

I encountered the same issue. Previously, I set per_device_train_batch_size=1 in TrainingArguments, but I found that if I comment out this line of code, the problem is resolved, and the trainer runs normally. Does anyone have any idea why this is happening?

arron666 · November 10, 2023, 3:58am

I also encountered this error. The reason for the error was that I should load data in jsonl format, but I loaded it in json format, which resulted in no data, and then this error occurred.

Topic		Replies	Views
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (5000000) higher than the number of available samples Beginners	2	1675	April 19, 2023
Using an IterableDataset for 1+ epochs in Trainer Beginners	3	140	January 2, 2025
Error training with iterabledatasets Beginners	1	634	July 22, 2022
TrainingArguments class - max_steps formula when using streaming dataset 🤗Transformers	1	3665	September 14, 2023
Trainer fails to resume training from a checkpoint, claiming there's not enough samples in the dataset Intermediate	1	1612	May 29, 2023

Setting max_steps with IterableDataset still errors

Related topics