Hi,
I have a 6 GPU machine, I want to run a) BERT for task-1 on the first 3 GPUs and b) BERT for task-2 on the last 3 GPUS.
But when I do that, I ran into this error âRuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8â
Please help me, how could this be fixed?
Can we do this type of running 2 separate models with accelerate in the first place?