Multi-GPU Distributed Training using Accelerate on Windows

rtb1271 · August 9, 2023, 4:38am

I am trying to use multi-gpu distributed training on a model using the Accelerate library. I have already setup my congifs using accelerate config and am using accelerate launch train.py but I keep getting the following errors:

raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed

raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

I am running this on a Windows system and I understand that NCCL is not available on Windows. Would appreciate if anyone could provide a workaround for Windows

Topic		Replies	Views
Multiple Model training on multiple GPUs 🤗Accelerate	1	1478	February 14, 2022
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	601	August 15, 2024
Executing the accelerate script within a child process 🤗Accelerate	0	215	October 18, 2023
SIGSEGV when training on multiple GPUs 🤗Accelerate	0	826	August 1, 2023
Why my Accelerate just doesn't work? 🤗Accelerate	2	6224	March 7, 2022

Multi-GPU Distributed Training using Accelerate on Windows

Related topics