Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs

seanswyi · May 3, 2023, 12:34am

I’m currently using HuggingFace Accelerate to run some distributed experiments and have the following code inside of my evaluation loop:

model.eval()
device = accelerator.device
intermediate_value = {}
output = [None] * accelerator.num_processes

# Some evaluation code.

dist.all_gather_object(output, intermediate_value)

When I’m using multiple GPUs it’s fine, but when I’m using only one I get the following error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

What I’m wondering is, I thought that if you wrap your model, optimizer, etc. using the HuggingFace Accelerate module then you didn’t have to do torch.distributed.init_process_group? And if this is the case, then how come it’s not working when I only have 1 GPU?

Thanks in advance.

seanswyi · July 5, 2023, 12:34am

Not sure if this answers my question directly, but I fixed the issue by wrapping the gathering around a try-except block because there are cases where I sometimes use one GPU vs. many. Not sure if this is optimal but it works.

It looks like:

try:
    dist.all_gather_object(output, intermediate_value)
except RuntimeError as e:
    logger.error("Caught RuntimeError: %s", e)
    output = [intermediate_value]

Open to any other opinions.

muellerzr · July 5, 2023, 2:17am

accelerator.gather() is guarenteed to always work even on single node

seanswyi · July 5, 2023, 3:14am

Thanks. Seems like I need to get a bit more familiar with the API lol.

Topic		Replies	Views
Proper way to gather output from accelerate multi-gpu inference Beginners	1	712	November 7, 2023
No GPUs found in distributed mode 🤗Accelerate	0	939	March 1, 2023
Why my Accelerate just doesn't work? 🤗Accelerate	2	6247	March 7, 2022
Is it possible that Accelerate may not divide the data evenly among processes? 🤗Accelerate	3	1053	July 5, 2023
Bug on multi-gpu trainer with accelerate 🤗Accelerate	6	516	February 18, 2025

Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs

Related topics