### System Info
```Shell
output of `accelerate env`: (note as shown below th…is prints the DEFAULT accelerate config and not the exact config being used for this job)
- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1037-aws-x86_64-with-glibc2.17
- Python version: 3.8.18
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 123.82 GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 2
- main_process_ip:
- main_process_port: 1234
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [X] My own task or dataset (give details below)
### Reproduction
I'm attempting to train a model with multi-node training, using SLURM scheduler. I am launching the job in 2 nodes with 8 GPUs each. My training script runs fine in a single-node environment with FSDP, and it *starts* fine in the multi-node setting -- until there is actual communication required between the nodes.
However, when the script gets to the parts that actually initialize multi-node training, it seems the processes are having issues communicating across nodes. I can see the logging output from all 16 processes, the data is loaded, etc. However, the script fails at `accelerator.prepare()`.
Specifically I see the stack trace containing these lines (complete stack trace is below):
```
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
```
Note that it is possible I have misconfigured the accelerate config, or the SLURM settings (tasks/node counts etc), but based on the example [here](https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25) with corresponding FSDP config [here](https://github.com/pacman100/DHS-LLM-Workshop/blob/6093a2320543c2ac903a1fbb9b034ea714db43c9/personal_copilot/training/configs/fsdp_config.yaml#L4) this seems to be set up correctly to me.
Any thoughts would be appreciated, I've tried lots of different configurations and tinkering with the environment to make sure the versions of pytorch/NCCL/accelerate are all compatible as well.
## Contents of fsdp_config_base.yaml I am using:
```
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
mixed_precision: bf16
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
## Relevant chunks of the sbatch script I am launching the job with:
```
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=a40x
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --time=04-00:00:00
#SBATCH --account=nextgends
#SBATCH --chdir=/admin/home-jpgard/rtfm
#SBATCH --output=/admin/home-jpgard/rtfm/slurm-out/%j.out
#SBATCH --err=/admin/home-jpgard/rtfm/slurm-out/%j.err
#SBATCH --exclude=ip-10-0-201-106,ip-10-0-202-154
################# code block adapted from https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25
set -x -e
# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO
GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)
# A function to parse slurm's node notation when it uses 'bracketed' values for SLURM_JOB_NODELIST
# Function to parse and expand node list from SLURM_JOB_NODELIST
expand_nodes() {
# The input is something like "ip-10-0-231-[1,86]"
local nodelist=$1
# Replace '[' and ']' with space and split the string
local base=$(echo $nodelist | sed -E 's/\[([0-9]+),([0-9]+)\]/ \1 \2 /')
# Read into array
read -a parts <<< "$base"
# Check if we have three parts: prefix, start, end
if [ ${#parts[@]} -eq 3 ]; then
local prefix=${parts[0]}
local start=${parts[1]}
local end=${parts[2]}
# Generate sequence
for i in $(seq $start $end); do
echo "${prefix}${i}"
return # Return after first IP to mimic head node behavior
done
else
# If the format does not include a range, just echo the input
echo $nodelist
fi
}
# Extract the first node name from SLURM_JOB_NODELIST
# This assumes the format "node-list: ip-10-0-209-157,ip-10-0-231-1" and extracts the first node name
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
# Now, resolve this node name to an IP address
# Using getent ahosts (You can also use nslookup if getent does not work as expected)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
# Check if we got an IP
if [ ! -z "$MASTER_ADDR" ]; then
echo "Head node IP: $MASTER_ADDR"
else
echo "Failed to resolve head node IP address"
# Extract the first node name from SLURM_JOB_NODELIST and expand if needed
node_name=$(expand_nodes $SLURM_JOB_NODELIST)
# Now, resolve this node name to an IP address using getent ahosts
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
echo "Head node IP after parsing: $MASTER_ADDR"
fi
MASTER_PORT=6999
# OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate launch \
--config_file /admin/home-jpgard/rtfm/fsdp_config_base.yaml \
--num_processes $NUM_PROCESSES \
--main_process_ip $MASTER_ADDR \
--num_machines $NNODES \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID \
"
echo "SLURM_JOB_ID is ${SLURM_JOB_ID}"
echo 'activating conda environment'
source /admin/home-jpgard/.bashrc
source /admin/home-jpgard/miniconda3/etc/profile.d/conda.sh
which conda
conda activate rtfm
which python
export PROGRAM="\
scripts/train.py \
--more-args-here
--bf16 True \
--use_amp \
"
export CMD="$LAUNCHER $PROGRAM"
echo "about to run ${CMD}"
/opt/slurm/bin/srun --jobid $SLURM_JOBID /usr/bin/bash -c "$CMD"
```
## Full stack trace:
```
Traceback (most recent call last):
File "scripts/train.py", line 582, in <module>
main(
File "scripts/train.py", line 206, in main
model = accelerator.prepare(model)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
model = FSDP(model, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
_sync_module_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to fe80::a849:bbff:fe73:19bd%veth8299cd8<54785> failed : Software caused connection abort
Traceback (most recent call last):
File "scripts/train.py", line 582, in <module>
main(
File "scripts/train.py", line 206, in main
model = accelerator.prepare(model)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
model = FSDP(model, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
_sync_module_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
warnings.warn(
[2024-02-16 02:42:00,874] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025500 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025502 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025503 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025504 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025505 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025506 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025507 closing signal SIGTERM
/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
warnings.warn(
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267868 closing signal SIGTERM
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267870 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267872 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267873 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267874 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267875 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267876 closing signal SIGTERM
[2024-02-16 02:42:03,298] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1267869) of binary: /admin/home-jpgard/miniconda3/envs/rtfm/bin/python
Traceback (most recent call last):
File "/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1010, in launch_command
multi_gpu_launcher(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-16_02:42:00
host : ip-10-0-209-157.us-west-2.compute.internal
rank : 9 (local_rank: 1)
exitcode : 1 (pid: 1267869)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```
### Expected behavior
I expect training to work in the distributed setting just as it does in the single-node setting.