Hi,
I am trying to perform multi-GPU training using accelerate, but I get a SIGSEGV on my second GPU.
More specifically: I am able to run training normally when configuring accelerate to use only a single GPU; however, if I attempt to use more than one, when running accelerate launch train_semantic_model.py
, I obtain the following:
Instantiating trainer...
Instantiating trainer...
Starting training.
Starting training.
0: loss: 6.401798248291016 lr: 0
0: valid loss 5.869513988494873
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3024297 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 3024298) of binary: /medias/tools/miniconda/envs/audiolm/bin/python
Traceback (most recent call last):
File "/medias/tools/miniconda/envs/audiolm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/medias/tools/miniconda/envs/audiolm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=========================================================
train_semantic_model.py FAILED
---------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-01_20:58:02
host : 192.168.1.13
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 3024298)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 3024298
=========================================================
My configuration file looks like this:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '0,1'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I am using accelerate 0.21.0 and torch 2.0.1+cu117. The CUDA version currently installed on my system, according to nvidia-smi
, is 12.0.
Does anyone have any idea about what could cause this segmentation fault? It’s probably not due to an out-of-memory error, since training on a single GPU works fine with the same hyperparameters.