Distributed GPU training not working

rishikesh · May 3, 2022, 12:46pm

I have made config file using ‘accelerate config’, I gave below parameters :

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): >0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU):
2
How many different machines will you use (use more than 1 for multi-node training)? [1]:
2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]:
1
What is the IP address of the machine that will host the main process?
172.31.27.4 (private ip address of AWS machine)
What is the port you will use to communicate with the main process?
8000
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
How many processes in total will you use? [1]: 2
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO

and then execute
accelerate launch train.py

It shows above warnings and gets stuck. I waited for more than 30 minutes but no output. What am I doing wrong ? Also, do I need to provide public ip address ? or private ip address of the AWS machine am using ?

dydavide · November 29, 2023, 8:19am

Did you fix this issue?

muellerzr · November 30, 2023, 1:50pm

If using multinode you need a config file on each node, one with rank 0 one with rank 1

Topic		Replies	Views
Detecting single gpu within each node 🤗Accelerate	2	765	January 17, 2023
Accelerate on 1 GPU 🤗Accelerate	2	1912	April 8, 2022
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6538	October 13, 2021
No GPUs found in distributed mode 🤗Accelerate	0	955	March 1, 2023
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1553	August 9, 2023

Distributed GPU training not working

Related topics