Thanks for the reply. It still does not work. Following, I will provide my configs again with test command:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: 10.149.4.12
main_process_port: 12345
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 1
main_process_ip: 10.149.4.12
main_process_port: 12345
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
I ran the following code to check if the nodes can find each other or not by running the following commands:
Command:
accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
output:
$ accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
**Test random number generator synchronization**
All rng are properly synched.
**DataLoader integration test**
0 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
Non-shuffled dataloader passing.
Shuffled dataloader passing.
Non-shuffled central dataloader passing.
Shuffled central dataloader passing.
**Training integration test**
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributed setup with no batch split.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributes setup with batch split.
FP16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
BF16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Command:
$ accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_2.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
Output:
**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
**Test random number generator synchronization**
All rng are properly synched.
**DataLoader integration test**
1 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
Shuffled dataloader passing.
Shuffled central dataloader passing.
**Training integration test**
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributed setup with no batch split.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributes setup with batch split.
FP16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
BF16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
However, I cannot figure out what the problem is.
Update:
I tried the following link to do distributed learning on two nodes using torch.distributed.launch
, and it seems it is working. However, Accelerate does not.