I am trying to run multi-node training with two nodes with one GPU in each:
This is my configuration:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0 # 1 in the second node
main_process_ip: 10.3.40.125 # Same on both nodes
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
use_cpu: false
I get the following logs on running accelerate test --config_file accelerate_config.yaml
on the slave node where master node just keeps waiting for synchronization.
Running: accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout:
stdout:
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr: synchronize_rng_states(["torch"])
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr: synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr: torch.distributed.broadcast(rng_state, 0)
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr: work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/pythonRunning: accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout:
stdout:
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr: synchronize_rng_states(["torch"])
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr: synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr: torch.distributed.broadcast(rng_state, 0)
stderr: File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr: work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/python
Your help would be appreciated.
Also if anyone can point to sources for beginners in multi-node training that would also be helpful. Accelerate documentation assumes you already know how it works.