Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs

I’m currently using HuggingFace Accelerate to run some distributed experiments and have the following code inside of my evaluation loop:

device = accelerator.device
intermediate_value = {}
output = [None] * accelerator.num_processes

# Some evaluation code.

dist.all_gather_object(output, intermediate_value)

When I’m using multiple GPUs it’s fine, but when I’m using only one I get the following error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

What I’m wondering is, I thought that if you wrap your model, optimizer, etc. using the HuggingFace Accelerate module then you didn’t have to do torch.distributed.init_process_group? And if this is the case, then how come it’s not working when I only have 1 GPU?

Thanks in advance.

Not sure if this answers my question directly, but I fixed the issue by wrapping the gathering around a try-except block because there are cases where I sometimes use one GPU vs. many. Not sure if this is optimal but it works.

It looks like:

    dist.all_gather_object(output, intermediate_value)
except RuntimeError as e:
    logger.error("Caught RuntimeError: %s", e)
    output = [intermediate_value]

Open to any other opinions.

accelerator.gather() is guarenteed to always work even on single node

1 Like

Thanks. Seems like I need to get a bit more familiar with the API lol.