Hi, I can’t figure out how to train using an IterableDataset. I keep running errors with the train step. If I don’t set the max_steps
training arg, I get an error claiming that the dataset doesn’t have a length. On the other hand, with that set to max_steps=150_000
I see this error.
Any pointers?
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (1500000) higher than the number of available samples.
{'eval_runtime': 3.9334, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0}
0%| | 0/1500000 [00:07<?, ?it/s]Traceback (most recent call last):
File "/home/ubuntu/app/urjit/python/./run.py", line 33, in <module>
ModelRunner()
File "/home/ubuntu/app/urjit/python/./run.py", line 23, in __init__
getattr(self, f"model_{args.model}")()
File "/home/ubuntu/app/urjit/python/./run.py", line 29, in model_dialogpt
dialogpt.main_dialogpt(args)
File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 269, in main_dialogpt
__dry_run__(args.input, args.train, args.batchsize,
File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 201, in __dry_run__
setup_and_train(tokenizer=tokenizer,
File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 100, in setup_and_train
train_loop(model=model, training_args=training_args,
File "/home/ubuntu/app/urjit/python/models/dialogpt.py", line 89, in train_loop
trainer.train()
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py", line 2055, in _inner_training_loop
train_loss = self._total_loss_scalar / self.state.global_step
ZeroDivisionError: float division by zero