ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error

Getting this error while pretraining LLama2 on A100 gpu. Using NCCL version 2.19.3. Running it on single vm with single A100 GPU.
Using HF trainer to train the model.

Spotllm:73025:73025 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
Spotllm:73025:73025 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
Spotllm:73025:73025 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
Spotllm:73025:73112 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
Spotllm:73025:73112 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
Spotllm:73025:73112 [0] NCCL INFO Using non-device net plugin version 0
Spotllm:73025:73112 [0] NCCL INFO Using network Socket
Spotllm:73025:73112 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x151ada46fe52b960 - Init START
Spotllm:73025:73112 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml
Spotllm:73025:73112 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
Spotllm:73025:73112 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml

Spotllm:73025:73112 [0] graph/search.cc:719 NCCL WARN XML Import Channel : dev 1 not found.

Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:749 → 2
Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:756 → 2
Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:873 → 2
Spotllm:73025:73112 [0] NCCL INFO init.cc:921 → 2
Spotllm:73025:73112 [0] NCCL INFO init.cc:1396 → 2
Spotllm:73025:73112 [0] NCCL INFO group.cc:64 → 2 [Async thread]
Spotllm:73025:73025 [0] NCCL INFO group.cc:418 → 2
Spotllm:73025:73025 [0] NCCL INFO group.cc:95 → 2
Traceback (most recent call last):
File “run_clm_with_peft.py”, line 937, in
main()
File “run_clm_with_peft.py”, line 899, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py”, line 1780, in train
return inner_training_loop(
File “/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py”, line 1933, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File “/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py”, line 1255, in prepare
result = self._prepare_deepspeed(*args)
File “/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py”, line 1640, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/init.py”, line 176, in initialize
engine = DeepSpeedEngine(args=args,
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 262, in init
self._configure_distributed_model(model)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 1157, in _configure_distributed_model
self._broadcast_model()
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 1077, in _broadcast_model
dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py”, line 117, in log_wrapper
return func(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py”, line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py”, line 489, in _fn
return fn(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/torch.py”, line 205, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py”, line 72, in wrapper
return func(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1914, in broadcast
work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
XML Import Channel : dev 1 not found.
Spotllm:73025:73025 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE
[2024-03-29 17:38:19,073] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 73025) of binary: /usr/bin/python3
Traceback (most recent call last):
File “/home/azureuser/.local/bin/torchrun”, line 8, in
sys.exit(main())
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py”, line 812, in main
run(args)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py”, line 803, in run
elastic_launch(
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_with_peft.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:
time : 2024-03-29_17:38:19
host : spotllm.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 73025)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.2 documentation