Hello, I’m using Trainer API, and I got this error:
***** Running training ***** Num examples = 80000000 Num Epochs = 9223372036854775807 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 16 Gradient Accumulation steps = 1 Total optimization steps = 5000000 0%| | 0/5000000 [00:00<?, ?it/s] There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (5000000) higher than the number of available samples.
My TrainerArgument is like
TrainingArguments( output_dir="./tmp", overwrite_output_dir=True, local_rank=args.local_rank, learning_rate=0.00025, per_device_train_batch_size= 8, # batch size for training per_device_eval_batch_size=8, save_steps=10000, # after # steps model is saved warmup_steps=2000, # number of warmup steps for learning rate scheduler max_steps=5000000, fp16=False, fp16_opt_level='01', sharded_ddp='zero_dp_3 auto_wrap', dataloader_num_workers=8,
The error says that there are too many steps, but in my iterabledataset, I didn’t set len(). I don’t know where does the Trainer get the number of samples, anyone has idea? Thanks.