Setting max_steps with IterableDataset still errors

Hi, I can’t figure out how to train using an IterableDataset. I keep running errors with the train step. If I don’t set the max_steps training arg, I get an error claiming that the dataset doesn’t have a length. On the other hand, with that set to max_steps=150_000 I see this error.

Any pointers?

There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (1500000) higher than the number of available samples.
{'eval_runtime': 3.9334, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0}
  0%|                                                                                                                                                             | 0/1500000 [00:07<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/app/urjit/python/./run.py", line 33, in <module>
    ModelRunner()
  File "/home/ubuntu/app/urjit/python/./run.py", line 23, in __init__
    getattr(self, f"model_{args.model}")()
  File "/home/ubuntu/app/urjit/python/./run.py", line 29, in model_dialogpt
    dialogpt.main_dialogpt(args)
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 269, in main_dialogpt
    __dry_run__(args.input, args.train, args.batchsize,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 201, in __dry_run__
    setup_and_train(tokenizer=tokenizer,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 100, in setup_and_train
    train_loop(model=model, training_args=training_args,
  File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 89, in train_loop
    trainer.train()
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 2055, in _inner_training_loop
    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero
2 Likes

Hi, I am also facing the same problem as yours

    train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero

Have you fix this error yet? If you’ve already fixed it, can you help me out please?

I am also facing the exact same error and there aren’t many resources on the internet that I am finding for this, could you please share what you did to solve it? I would really appreciate the help!

I encountered the same issue. Previously, I set per_device_train_batch_size=1 in TrainingArguments, but I found that if I comment out this line of code, the problem is resolved, and the trainer runs normally. Does anyone have any idea why this is happening?

I also encountered this error. The reason for the error was that I should load data in jsonl format, but I loaded it in json format, which resulted in no data, and then this error occurred.