Distributed GPU training not working

I have made config file using ‘accelerate config’, I gave below parameters :

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): >0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU):
2
How many different machines will you use (use more than 1 for multi-node training)? [1]:
2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]:
1
What is the IP address of the machine that will host the main process?
172.31.27.4 (private ip address of AWS machine)
What is the port you will use to communicate with the main process?
8000
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
How many processes in total will you use? [1]: 2
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO

and then execute
accelerate launch train.py


It shows above warnings and gets stuck. I waited for more than 30 minutes but no output. What am I doing wrong ? Also, do I need to provide public ip address ? or private ip address of the AWS machine am using ?

3 Likes

Did you fix this issue?

If using multinode you need a config file on each node, one with rank 0 one with rank 1