Multiple Model training on multiple GPUs

abhi4572 · February 10, 2022, 6:38am

Hi,
I have a 6 GPU machine, I want to run a) BERT for task-1 on the first 3 GPUs and b) BERT for task-2 on the last 3 GPUS.
But when I do that, I ran into this error “RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8”

Please help me, how could this be fixed?
Can we do this type of running 2 separate models with accelerate in the first place?

abhi4572 · February 14, 2022, 3:43am

Kindly help.

Topic		Replies	Views
Using another model when training a model with accelerate on multi-GPUs 🤗Accelerate	1	1203	October 31, 2022
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1537	August 9, 2023
Model Parallelism, how to parallelize transformer? Beginners	3	12728	June 18, 2021
Unable to train Bert by splitting across GPUs 🤗Transformers	0	456	June 24, 2022
How to run the Causal Language modelling example on multiple gpu? 🤗Transformers	0	80	September 16, 2024

Multiple Model training on multiple GPUs

Related topics