I am on a slurm cluster, and this slurm script without accelerate works:
#!/bin/bash
#Submit this script with: sbatch filename
#SBATCH --time=0:20:00 # walltime
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --qos=standard # qos name
#SBATCH --mem=0
export MASTER_PORT=7000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "NODELIST="${SLURM_NODELIST}
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export MASTER_PORT=7000
srun python echo.py
running this code echo.py:
#!/usr/bin/env python3
import io
import os
import pprint
import sys
import torch.distributed as dist
world_size = int(os.environ["WORLD_SIZE"])
global_rank = int(os.environ['SLURM_PROCID'])
local_rank = int(os.environ['SLURM_LOCALID'])
init_str = f'tcp://{os.environ["MASTER_ADDR"]}:{os.environ["MASTER_PORT"]}'
mygroup = dist.init_process_group(backend="nccl", init_method=init_str, world_size=world_size, rank=global_rank)
print (f'I am locally {local_rank} and globally {global_rank} out of {world_size}!')
producing
I am locally 3 and globally 7 out of 8!
I am locally 0 and globally 4 out of 8!
I am locally 1 and globally 5 out of 8!
I am locally 2 and globally 6 out of 8!
I am locally 0 and globally 0 out of 8!
I am locally 1 and globally 1 out of 8!
I am locally 2 and globally 2 out of 8!
I am locally 3 and globally 3 out of 8!
However, if I try to run with accelerate, having tried a dozen examples I have seen online I get failure. The latest script I have tried is
#Submit this script with: sbatch filename
#SBATCH --time=0:20:00 # walltime
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --job-name=gpt2 # job name
#SBATCH --mem=0
export MASTER_PORT=7000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "NODELIST="${SLURM_NODELIST}
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export MASTER_PORT=7000
srun accelerate launch --multi_gpu --num_machines $SLURM_NNODES --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT --num_processes $WORLD_SIZE --machine_rank $SLURM_PROCID --role $SLURMD_NODENAME --rdzv_conf rdzv_backend=c10d --max_restarts 0 echo.py
and I get errors like this:
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
Traceback (most recent call last):
File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
args.func(args)
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
sys.exit(main())
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
Traceback (most recent call last):
File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
args.func(args)
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
multi_gpu_launcher(args)
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
sys.exit(main())
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
distrib_run.run(args)
...
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
self._store = TCPStore( # type: ignore[call-arg]
result = f(*args, **kwargs)
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
self._rendezvous(worker_group)
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
I have tried this with and without dist.init_process_group defined in echo. I am not at a loss what to try next. Any suggestions would be apprciated. Thanks!