Multi Node GPU: `connecting to address with family 7299 is neither AF_INET(2) nor AF_INET6(10)`

I am trying to fine tune mistral 7B across two machines each with 8 A100s.

Config on machine 1:

debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 2
main_process_ip: 10.141.0.12
main_process_port: 1234
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Config on machine 2:

debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 1
main_training_function: main
mixed_precision: 'bf16'
main_process_ip: 10.141.0.12
main_process_port: 1234
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then I run the same command on both machines:

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml --gradient_accumulation_steps=1 examples/scripts/sft.py --model_name=mistralai/Mistral-7B-v0.1 --seq_length=2048 --batch_size=1 --gradient_accumulation_steps=1 --use_auth_token=false

It successfully loads the checkpoints on both machines and then on machine 2 it fails with:

Traceback (most recent call last):
  File "examples/scripts/sft.py", line 153, in <module>
    trainer.train()
  File "/home/setup/dev/trl/trl/trainer/sft_trainer.py", line 290, in train
    output = super().train(*args, **kwargs)
  File "/home/setup/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/setup/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/setup/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/setup/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
    self._configure_distributed_model(model)
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1129, in _configure_distributed_model
    self._broadcast_model()
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1049, in _broadcast_model
    dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/home/setup/.local/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 196, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/home/setup/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/home/setup/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1570, in broadcast
    work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : connecting to address  with family 7299 is neither AF_INET(2) nor AF_INET6(10)

This looks like the error in the nccl code base here.

I know that nccl can successfully communicate between the two machines. Using NVIDIAā€™s nccl-tests, I run:

$ mpirun -np 2 -H 127.0.0.1:1,10.141.0.22:1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 966823 on lambda-hyperplane12 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 966823 on lambda-hyperplane12 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 966823 on lambda-hyperplane12 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 966823 on lambda-hyperplane12 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 966823 on lambda-hyperplane12 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 966823 on lambda-hyperplane12 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 966823 on lambda-hyperplane12 device  6 [0xc0] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 966823 on lambda-hyperplane12 device  7 [0xc3] NVIDIA A100-SXM4-80GB
#  Rank  0 Group  0 Pid 1972306 on xander-gpu-dev device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 1972306 on xander-gpu-dev device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 1972306 on xander-gpu-dev device  2 [0x45] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 1972306 on xander-gpu-dev device  3 [0x4b] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 1972306 on xander-gpu-dev device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 1972306 on xander-gpu-dev device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 1972306 on xander-gpu-dev device  6 [0xc0] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 1972306 on xander-gpu-dev device  7 [0xc3] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    31.04    0.00    0.00      0    30.98    0.00    0.00      0
          16             4     float     sum      -1    31.03    0.00    0.00      0    30.78    0.00    0.00      0
          32             8     float     sum      -1    30.61    0.00    0.00      0    31.07    0.00    0.00      0
          64            16     float     sum      -1    30.72    0.00    0.00      0    30.73    0.00    0.00      0
         128            32     float     sum      -1    30.67    0.00    0.01      0    31.04    0.00    0.01      0
         256            64     float     sum      -1    31.39    0.01    0.01      0    30.70    0.01    0.01      0
         512           128     float     sum      -1    31.16    0.02    0.03      0    30.83    0.02    0.03      0
        1024           256     float     sum      -1    30.57    0.03    0.06      0    31.42    0.03    0.06      0
        2048           512     float     sum      -1    30.74    0.07    0.12      0    30.75    0.07    0.12      0
        4096          1024     float     sum      -1    30.80    0.13    0.23      0    31.04    0.13    0.23      0
        8192          2048     float     sum      -1    31.01    0.26    0.46      0    31.46    0.26    0.46      0
       16384          4096     float     sum      -1    31.81    0.52    0.90      0    31.02    0.53    0.92      0
       32768          8192     float     sum      -1    32.87    1.00    1.74      0    31.95    1.03    1.79      0
       65536         16384     float     sum      -1    34.46    1.90    3.33      0    33.11    1.98    3.46      0
      131072         32768     float     sum      -1    36.08    3.63    6.36      0    35.61    3.68    6.44      0
      262144         65536     float     sum      -1    40.70    6.44   11.27      0    39.36    6.66   11.66      0
      524288        131072     float     sum      -1    47.73   10.99   19.22      0    46.90   11.18   19.56      0
     1048576        262144     float     sum      -1    55.22   18.99   33.23      0    55.82   18.78   32.87      0
     2097152        524288     float     sum      -1    78.61   26.68   46.68      0    78.41   26.75   46.80      0
     4194304       1048576     float     sum      -1    102.7   40.83   71.45      0    101.0   41.51   72.64      0
     8388608       2097152     float     sum      -1    159.0   52.76   92.33      0    154.2   54.40   95.20      0
    16777216       4194304     float     sum      -1    259.1   64.76  113.34      0    259.0   64.76  113.34      0
    33554432       8388608     float     sum      -1    414.3   80.98  141.72      0    412.1   81.42  142.49      0
    67108864      16777216     float     sum      -1    639.9  104.87  183.52      0    638.3  105.14  183.99      0
   134217728      33554432     float     sum      -1   1260.9  106.44  186.28      0   1255.1  106.94  187.15      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 36.6311
#

Versions:

Ubuntu 20.04
transformers             4.35.2
accelerate               0.24.1
deepspeed                0.12.3
torch                    2.0.1
$ python3 -c 'import torch; print(torch.version.cuda)'
11.7

Any ideas on what may be causing the nccl failure to connect given that nccl-tests is able to communicate between the two machines? Some missing accelerate / deepspeed configuration seems likely.

  1. I misinterpreted the above nccl-tests results. It was not testing cross-node communications, only cross-GPU communications within a node.

  2. I realized that I needed to use sudo to get the Infiniband benchmark tests to work, such as ib_write_bw. This was an issue with the ulimit -l, See here and here

With that fixed, I can successfully train cross-machine.

2 Likes