I am trying to fine tune mistral 7B across two machines each with 8 A100s.
Config on machine 1:
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 2
main_process_ip: 10.141.0.12
main_process_port: 1234
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Config on machine 2:
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_training_function: main
mixed_precision: 'bf16'
main_process_ip: 10.141.0.12
main_process_port: 1234
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Then I run the same command on both machines:
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml --gradient_accumulation_steps=1 examples/scripts/sft.py --model_name=mistralai/Mistral-7B-v0.1 --seq_length=2048 --batch_size=1 --gradient_accumulation_steps=1 --use_auth_token=false
It successfully loads the checkpoints on both machines and then on machine 2 it fails with:
Traceback (most recent call last):
File "examples/scripts/sft.py", line 153, in <module>
trainer.train()
File "/home/setup/dev/trl/trl/trainer/sft_trainer.py", line 290, in train
output = super().train(*args, **kwargs)
File "/home/setup/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/setup/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/setup/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/home/setup/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
self._configure_distributed_model(model)
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1129, in _configure_distributed_model
self._broadcast_model()
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1049, in _broadcast_model
dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 196, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/home/setup/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/home/setup/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1570, in broadcast
work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : connecting to address with family 7299 is neither AF_INET(2) nor AF_INET6(10)
This looks like the error in the nccl code base here.
I know that nccl can successfully communicate between the two machines. Using NVIDIAâs nccl-tests, I run:
$ mpirun -np 2 -H 127.0.0.1:1,10.141.0.22:1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 966823 on lambda-hyperplane12 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 966823 on lambda-hyperplane12 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 966823 on lambda-hyperplane12 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 966823 on lambda-hyperplane12 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 966823 on lambda-hyperplane12 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 966823 on lambda-hyperplane12 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 966823 on lambda-hyperplane12 device 6 [0xc0] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 966823 on lambda-hyperplane12 device 7 [0xc3] NVIDIA A100-SXM4-80GB
# Rank 0 Group 0 Pid 1972306 on xander-gpu-dev device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 1972306 on xander-gpu-dev device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 1972306 on xander-gpu-dev device 2 [0x45] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 1972306 on xander-gpu-dev device 3 [0x4b] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 1972306 on xander-gpu-dev device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 1972306 on xander-gpu-dev device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 1972306 on xander-gpu-dev device 6 [0xc0] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 1972306 on xander-gpu-dev device 7 [0xc3] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 31.04 0.00 0.00 0 30.98 0.00 0.00 0
16 4 float sum -1 31.03 0.00 0.00 0 30.78 0.00 0.00 0
32 8 float sum -1 30.61 0.00 0.00 0 31.07 0.00 0.00 0
64 16 float sum -1 30.72 0.00 0.00 0 30.73 0.00 0.00 0
128 32 float sum -1 30.67 0.00 0.01 0 31.04 0.00 0.01 0
256 64 float sum -1 31.39 0.01 0.01 0 30.70 0.01 0.01 0
512 128 float sum -1 31.16 0.02 0.03 0 30.83 0.02 0.03 0
1024 256 float sum -1 30.57 0.03 0.06 0 31.42 0.03 0.06 0
2048 512 float sum -1 30.74 0.07 0.12 0 30.75 0.07 0.12 0
4096 1024 float sum -1 30.80 0.13 0.23 0 31.04 0.13 0.23 0
8192 2048 float sum -1 31.01 0.26 0.46 0 31.46 0.26 0.46 0
16384 4096 float sum -1 31.81 0.52 0.90 0 31.02 0.53 0.92 0
32768 8192 float sum -1 32.87 1.00 1.74 0 31.95 1.03 1.79 0
65536 16384 float sum -1 34.46 1.90 3.33 0 33.11 1.98 3.46 0
131072 32768 float sum -1 36.08 3.63 6.36 0 35.61 3.68 6.44 0
262144 65536 float sum -1 40.70 6.44 11.27 0 39.36 6.66 11.66 0
524288 131072 float sum -1 47.73 10.99 19.22 0 46.90 11.18 19.56 0
1048576 262144 float sum -1 55.22 18.99 33.23 0 55.82 18.78 32.87 0
2097152 524288 float sum -1 78.61 26.68 46.68 0 78.41 26.75 46.80 0
4194304 1048576 float sum -1 102.7 40.83 71.45 0 101.0 41.51 72.64 0
8388608 2097152 float sum -1 159.0 52.76 92.33 0 154.2 54.40 95.20 0
16777216 4194304 float sum -1 259.1 64.76 113.34 0 259.0 64.76 113.34 0
33554432 8388608 float sum -1 414.3 80.98 141.72 0 412.1 81.42 142.49 0
67108864 16777216 float sum -1 639.9 104.87 183.52 0 638.3 105.14 183.99 0
134217728 33554432 float sum -1 1260.9 106.44 186.28 0 1255.1 106.94 187.15 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 36.6311
#
Versions:
Ubuntu 20.04
transformers 4.35.2
accelerate 0.24.1
deepspeed 0.12.3
torch 2.0.1
$ python3 -c 'import torch; print(torch.version.cuda)'
11.7
Any ideas on what may be causing the nccl failure to connect given that nccl-tests is able to communicate between the two machines? Some missing accelerate / deepspeed configuration seems likely.