I trained my model on HPC with one node and 8 GPUs, but my program crashed with a fatal error, which seems nothing to do with my code, after the training and testing process completed.
Actually, this error did not affect me too negatively since it only happened after the training and testing process completed, but I am curious as to why this is.
Fatal Python error: Segmentation fault
Current thread 0x00007f1efb215740 (most recent call first):
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 863 in _invoke_run
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/torchrun", line 33 in <module>
Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 21)
Traceback (most recent call last):
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/bin/accelerate", line 10, in <module>
sys.exit(main())
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 678, in launch_command
multi_gpu_launcher(args)
File "/mnt/cache/zhengchjun.vendor/anaconda3/envs/torch1.11_cuda11.3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 354, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '8', 'main.py', '--train', '--test', '--model_type', 'swin', '--num_workers', '8', '--bsz', '256', '--epochs', '50', '--data_dir', '/mnt/cache/share/images']' died with <Signals.SIGSEGV: 11>.