NCCL Timeout Accelerate Load From Checkpoint

Hi all, I had a quick question. I’m having issues when I try to resume from a checkpoint when using an IterableDataset. It will get to the first accelerate.backward step, then fail with an NCCL timeout like

[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559, OpType=REDUCE, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out.

I even added

timeout = InitProcessGroupKwargs(timeout=timedelta(seconds=1800 * 2))
accelerator = Accelerator(

and set NCCL_ASYNC_ERROR_HANDLING=1 but still see the timeout after 1800 seconds.

Roughly, this is how I load the checkpoint and then skip batches

if config["train_args"]["resume_from_checkpoint"]:
        # Loads the DeepSpeed checkpoint from the specified path
        accelerator.print(f"Resumed from checkpoint: {config['train_args']['resume_from_checkpoint']}")
        path = os.path.basename(config["train_args"]["resume_from_checkpoint"])
        training_difference = os.path.splitext(path)[0]

        resume_step = int(training_difference.replace("step_", ""))
        resume_step = -1


    progress_bar = tqdm(range(max_steps), disable=not accelerator.is_local_main_process)

if config["train_args"]["resume_from_checkpoint"] and resume_step is not None:
    # We need to skip steps until we reach the resumed step
    train_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step)
    total_steps += resume_step
    accelerator.print(f"Resuming training from step {resume_step}")

accelerator.print(f"Resumed training on rank {accelerator.state.process_index}")

for batch in train_dataloader:
    loss = model(**batch)

Do you have any suggestions on the best path forward? I’m scratching my head here and not sure exactly what to do.