Hello, I’m using Trainer API, and I got this error:
***** Running training *****
Num examples = 80000000
Num Epochs = 9223372036854775807
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 5000000
0%| | 0/5000000 [00:00<?, ?it/s]
There seems to be not a single sample in your epoch_iterator, stopping training at step 0!
This is expected if you're using an IterableDataset and set num_steps (5000000) higher than the number of available samples.
My TrainerArgument is like
TrainingArguments(
output_dir="./tmp",
overwrite_output_dir=True,
local_rank=args.local_rank,
learning_rate=0.00025,
per_device_train_batch_size= 8, # batch size for training
per_device_eval_batch_size=8,
save_steps=10000, # after # steps model is saved
warmup_steps=2000, # number of warmup steps for learning rate scheduler
max_steps=5000000,
fp16=False,
fp16_opt_level='01',
sharded_ddp='zero_dp_3 auto_wrap',
dataloader_num_workers=8,
The error says that there are too many steps, but in my iterabledataset, I didn’t set len(). I don’t know where does the Trainer get the number of samples, anyone has idea? Thanks.