Evaluate doesn't play nicely with Accelerate in multi-GPU settings

Evaluate when imported in a distributed settings through accelerate (probably even through torch.distributed.launch) causes accelerate to crash with a NCCL error. Here is a minimal repro to check which was run on 2 GPUs:

from accelerate import Accelerator
import evaluate


if __name__ == "__main__":
    accelerator = Accelerator()
    print(accelerator.state)
    accelerator.wait_for_everyone()

Save this as repro.py and run it using:

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu repro.py

This crashes with error such as:

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:125, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

Now if you comment out evaluate’s import and then launch the script it runs smoothly.

On further debugging, I reached to the root import which is causing this issue is TFPretrainedModel imported at this line.

I also noticed that this line is gone now with a refactor in the main branch but I am still keen to figure out the root cause of this so that it doesn’t happen in any other place.

Thanks for reporting @aps and sharing a clean repro!

Unfortunately, I am not able to reproduce the error on my end. Here’s what I get from running your repro script:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Could you please share what hardware you’re running on and what env? You can get the latter for accelerate by running:

accelerate env

which gives for me:

- `Accelerate` version: 0.12.0
- Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.23.3
- PyTorch version (GPU?): 1.12.1+cu102 (True)
- `Accelerate` default config:
        Not found

I’m also using evaluate==0.2.2 - is this the same as you?

1 Like

I also cannot recreate this error. What ncll version are you using and cuda versions? This is also extremely important in this case :slight_smile:

Sorry, my bad, I should have shared the exact versions of the libraries.

Here you go:

- `Accelerate` version: 0.12.0
- Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- `Accelerate` default config:
	Not found
- `evaluate` version: 0.2.2
- `transformers` version: 4.20.1
- NCCL version is 21.0.3

The machine I am running is on GCP with 2 V100-16GB.

It might also be fixed by upgrading the transformers version?