Command died with <Signals.SIGSEGV: 11>

I trained my model on HPC with one node and 8 GPUs, but my program crashed with a fatal error, which seems nothing to do with my code, after the training and testing process completed.

Actually, this error did not affect me too negatively since it only happened after the training and testing process completed, but I am curious as to why this is.

Fatal Python error: Segmentation fault

Current thread 0x00007f1efb215740 (most recent call first):  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 863 in _invoke_run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper  
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/torchrun", line 33 in <module>

Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 21)
Traceback (most recent call last):
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/accelerate", line 10, in <module>
    sys.exit(main())
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 678, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 354, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '8', 'main.py', '--train', '--test', '--model_type', 'swin', '--num_workers', '8', '--bsz', '256', '--epochs', '50', '--data_dir', '/mnt/cache/share/images']' died with <Signals.SIGSEGV: 11>.
1 Like

Same error occurs to me. Does anyone have a solution?

1 Like