Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers

I am using accelerate with deepspeed stage 1. I get the following error while saving the model:

NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(Seq
Num=913, OpType=ALLREDUCE, NumelIn=499133440, NumelOut=499133440, Timeout(ms)=600000) ran for 600489 milliseconds before timing out.

I am not sure why the timeout is 600000ms. I tried to set the timeout in the code using InitProcessGroupKwargs

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=3600))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
kwargs_handlers=[kwargs]
)

But this did not change the timeout at all. How can I increase the current timeout from 600s to 3600s ?

accelerate verison: 0.29.2

Hi @rranjan1, could you check if the following ?

import time
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from torch import tensor

kwargs = [InitProcessGroupKwargs(timeout=timedelta(seconds=10))]
accelerator = Accelerator(kwargs_handlers=kwargs)

if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")

It should fail since the timeout is 4. If you change it to 10, it should works. I just want to see if the timeout is indeed changed as expected.

I checked this with different timeout=timedelta(seconds=TK) value ranging from 1 to 10. The code never fails for any of these values.