Detecting single gpu within each node

Hi,

I have two nodes, each containing 3 A6000 GPUs. I am using Accelerate library to do multi-node training with two following config files:

1. default_config_1.yaml

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 4
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: 10.149.4.12
main_process_port: 5000
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

2. default_config_2.yaml

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 4
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 1
main_process_ip: 10.149.4.12
main_process_port: 5000
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Using these YAML files, I will call the training script on each node separately as following:

Node1:

accelerate launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml --multi_gpu train.py --data ./train.pkl --out_dir ./checkpoint/ --bs 80 --prefix_dim 1024 --prefix cc --prefix_length 5

Node2:

accelerate launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_2.yaml --multi_gpu train.py --data ./train.pkl --out_dir ./checkpoint/ --bs 80 --prefix_dim 1024 --prefix cc --prefix_length 5

However, I can observe that only the Accelerate library uses the first GPU in each node. I have also tried without the deepspeed option, but I have the same issue. Could you please help me in this regard? Thanks.

Try just with:

accelerate launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml train.py --data ./train.pkl --out_dir ./checkpoint/ --bs 80 --prefix_dim 1024 --prefix cc --prefix_length 5

Or, let the config file do all the work

Thanks for the reply. It still does not work. Following, I will provide my configs again with test command:

  • node 1 (master):
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: 10.149.4.12
main_process_port: 12345
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
  • node 2 (slave)
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 1
main_process_ip: 10.149.4.12
main_process_port: 12345
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

I ran the following code to check if the nodes can find each other or not by running the following commands:

  • node 1 (master)

Command:

accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py

output:

$ accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_1.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py                                    
**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16


**Test random number generator synchronization**
All rng are properly synched.

**DataLoader integration test**
0 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
Non-shuffled dataloader passing.
Shuffled dataloader passing.
Non-shuffled central dataloader passing.
Shuffled central dataloader passing.

**Training integration test**
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributed setup with no batch split.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributes setup with batch split.
FP16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
BF16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
  • node 2 (slave):

Command:

$ accelerate-launch --config_file=/home/username/.cache/huggingface/accelerate/default_config_2.yaml /home/mderaksh/miniconda3/envs/dassl/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py

Output:

**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 0
Device: cuda:0
Mixed precision type: fp16


**Test random number generator synchronization**
All rng are properly synched.

**DataLoader integration test**
1 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
Shuffled dataloader passing.
Shuffled central dataloader passing.

**Training integration test**
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributed setup with no batch split.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Training yielded the same results on one CPU or distributes setup with batch split.
FP16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
BF16 training check.
Model dtype: torch.float32, torch.float32. Input dtype: torch.float32

However, I cannot figure out what the problem is. :frowning:

Update:

I tried the following link to do distributed learning on two nodes using torch.distributed.launch, and it seems it is working. However, Accelerate does not.