I am using accelerate with deepspeed stage 1. I get the following error while saving the model:
NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(Seq
Num=913, OpType=ALLREDUCE, NumelIn=499133440, NumelOut=499133440, Timeout(ms)=600000) ran for 600489 milliseconds before timing out.
I am not sure why the timeout is 600000ms. I tried to set the timeout in the code using InitProcessGroupKwargs
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=3600))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
kwargs_handlers=[kwargs]
)
But this did not change the timeout at all. How can I increase the current timeout from 600s to 3600s ?
accelerate verison: 0.29.2