Bug on multi-gpu trainer with accelerate

I am running the sample object_detection script from hugging face on 4 gpus using accelerate launch.

As soon as the evaluation_loop is ran after the end of epoch 1. It fails at the metrics = metric.compute() step with the following error:

[rank3]:     work = group.allgather([tensor_list], [tensor])
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: No backend type associated with device type cpu
1 Like

Possibly unresolved issue…

Thank you @John6666 ! Yes I saw that issue. It seems to be a common problem. The evaluation phase within each epoch is quite fiddly and even with using gather I am not seeing the correct number of records per batches

1 Like

I thought there might be an issue with gather or allgather, but it doesn’t seem to be the case for this problem.

Since it’s limited to multi-GPU, the number of users is limited, so there may not be many reports, and there may be many potential unresolved bugs.
I can’t find anything like that on the github for the accelerate library either…

I am now using this fix. The script is now running fully however I have no way of checking if torchmetrics.Metric did use all the records in my validation set. Any ideas @John6666?

1 Like

It’s a primitive and dirty method, but could you override some function and insert a print statement or logger?
Since it’s Python, you could also modify the code of the library itself…

Found a solution by creating another torchmetrics using SumMetric()

1 Like