Command died with <Signals.SIGSEGV: 11>

WeiYun1025 · August 9, 2022, 3:36am

I trained my model on HPC with one node and 8 GPUs, but my program crashed with a fatal error, which seems nothing to do with my code, after the training and testing process completed.

Actually, this error did not affect me too negatively since it only happened after the training and testing process completed, but I am curious as to why this is.

Fatal Python error: Segmentation fault

Current thread 0x00007f1efb215740 (most recent call first):  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 863 in _invoke_run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/torchrun", line 33 in <module>

Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 21)
Traceback (most recent call last):
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/accelerate", line 10, in <module>
    sys.exit(main())
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 678, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 354, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '8', 'main.py', '--train', '--test', '--model_type', 'swin', '--num_workers', '8', '--bsz', '256', '--epochs', '50', '--data_dir', '/mnt/cache/share/images']' died with <Signals.SIGSEGV: 11>.

ilyi · February 28, 2023, 11:47am

Same error occurs to me. Does anyone have a solution?

Topic		Replies	Views
SIGSEGV when training on multiple GPUs 🤗Accelerate	0	822	August 1, 2023
Errors when training on multi node single gpu 🤗Transformers	1	1752	February 25, 2022
Segmentation Fault while runing example from token classification 🤗Transformers	0	1078	September 13, 2022
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	598	August 15, 2024
RAM memory issues while training with torch.distributed.launch 🤗Transformers	1	1021	October 19, 2022

Command died with <Signals.SIGSEGV: 11>

Related topics