Hi all, I had a quick question. I’m having issues when I try to resume from a checkpoint when using an IterableDataset. It will get to the first accelerate.backward
step, then fail with an NCCL timeout like
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559, OpType=REDUCE, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out.
I even added
timeout = InitProcessGroupKwargs(timeout=timedelta(seconds=1800 * 2))
accelerator = Accelerator(
log_with="wandb",
kwargs_handlers=[timeout]
)
and set NCCL_ASYNC_ERROR_HANDLING=1
but still see the timeout after 1800 seconds.
Roughly, this is how I load the checkpoint and then skip batches
if config["train_args"]["resume_from_checkpoint"]:
# Loads the DeepSpeed checkpoint from the specified path
accelerator.print(f"Resumed from checkpoint: {config['train_args']['resume_from_checkpoint']}")
accelerator.load_state(config["train_args"]["resume_from_checkpoint"])
path = os.path.basename(config["train_args"]["resume_from_checkpoint"])
training_difference = os.path.splitext(path)[0]
resume_step = int(training_difference.replace("step_", ""))
else:
resume_step = -1
accelerator.wait_for_everyone()
progress_bar = tqdm(range(max_steps), disable=not accelerator.is_local_main_process)
if config["train_args"]["resume_from_checkpoint"] and resume_step is not None:
# We need to skip steps until we reach the resumed step
train_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step)
total_steps += resume_step
progress_bar.update(resume_step)
accelerator.print(f"Resuming training from step {resume_step}")
torch.distributed.barrier()
accelerator.print(f"Resumed training on rank {accelerator.state.process_index}")
for batch in train_dataloader:
loss = model(**batch)
accelerator.backward(loss)
Do you have any suggestions on the best path forward? I’m scratching my head here and not sure exactly what to do.