Slurm Issues running accelerate

I am on a slurm cluster, and this slurm script without accelerate works:

#!/bin/bash
#Submit this script with: sbatch filename
#SBATCH --time=0:20:00   # walltime
#SBATCH --nodes=2   # number of nodes
#SBATCH --ntasks-per-node=4   # number of tasks per node
#SBATCH --qos=standard   # qos name
#SBATCH --mem=0

export MASTER_PORT=7000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "NODELIST="${SLURM_NODELIST}
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export MASTER_PORT=7000

srun python echo.py

running this code echo.py:

#!/usr/bin/env python3
import io
import os
import pprint
import sys
import torch.distributed as dist

world_size = int(os.environ["WORLD_SIZE"])
global_rank = int(os.environ['SLURM_PROCID'])
local_rank = int(os.environ['SLURM_LOCALID'])

init_str = f'tcp://{os.environ["MASTER_ADDR"]}:{os.environ["MASTER_PORT"]}'
mygroup = dist.init_process_group(backend="nccl", init_method=init_str, world_size=world_size, rank=global_rank)

print (f'I am locally  {local_rank} and globally {global_rank} out of {world_size}!')

producing

I am locally  3 and globally 7 out of 8!
I am locally  0 and globally 4 out of 8!
I am locally  1 and globally 5 out of 8!
I am locally  2 and globally 6 out of 8!
I am locally  0 and globally 0 out of 8!
I am locally  1 and globally 1 out of 8!
I am locally  2 and globally 2 out of 8!
I am locally  3 and globally 3 out of 8!

However, if I try to run with accelerate, having tried a dozen examples I have seen online I get failure. The latest script I have tried is

#Submit this script with: sbatch filename
#SBATCH --time=0:20:00   # walltime
#SBATCH --nodes=2   # number of nodes
#SBATCH --ntasks-per-node=4   # number of tasks per node
#SBATCH --job-name=gpt2   # job name
#SBATCH --mem=0


export MASTER_PORT=7000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "NODELIST="${SLURM_NODELIST}
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export MASTER_PORT=7000

srun accelerate launch --multi_gpu --num_machines $SLURM_NNODES  --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT --num_processes $WORLD_SIZE --machine_rank $SLURM_PROCID --role $SLURMD_NODENAME --rdzv_conf rdzv_backend=c10d --max_restarts 0 echo.py

and I get errors like this:

[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
Traceback (most recent call last):
  File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
    args.func(args)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
    sys.exit(main())
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
Traceback (most recent call last):
  File "/users/jsmidt/.local/bin/accelerate", line 8, in <module>
    args.func(args)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    sys.exit(main())
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
    distrib_run.run(args)
...
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
    result = f(*args, **kwargs)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).
    self._rendezvous(worker_group)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:7000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:7000 (errno: 98 - Address already in use).

I have tried this with and without dist.init_process_group defined in echo. I am not at a loss what to try next. Any suggestions would be apprciated. Thanks!

1 Like

Even if the post is old, I’d like to share my experience to solve a similar issue I had in the past when using torchrun

I added these lines in my launcher when trying to understand what was going on:

# OPTIONAL: set to true if you want more details on NCCL communications
DEBUG=true
if [ "$DEBUG" == "true" ]; then
    export LOGLEVEL=INFO
    export NCCL_DEBUG=TRACE
    export TORCH_CPP_LG_LEVEL=INFO
    export NCCL_ASYNC_ERROR_HANDLING=1

else
    echo "Debug mode is off."
fi

The trap is that one must distinguish between the slurm --ntasks-per-node which should be 1, and the --nproc_per_node flag to give to torchrun, which should be --nproc_per_node ${NGPUS_PER_NODE}`

If you don’t do this and set slurm --ntasks-per-node to ${NGPUS_PER_NODE}, communication problem might pop up as you will duplicate processes that will step onto each other.

1 Like