Getting this error while pretraining LLama2 on A100 gpu. Using NCCL version 2.19.3. Running it on single vm with single A100 GPU.
Using HF trainer to train the model.
Spotllm:73025:73025 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
Spotllm:73025:73025 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
Spotllm:73025:73025 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
Spotllm:73025:73112 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
Spotllm:73025:73112 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
Spotllm:73025:73112 [0] NCCL INFO Using non-device net plugin version 0
Spotllm:73025:73112 [0] NCCL INFO Using network Socket
Spotllm:73025:73112 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x151ada46fe52b960 - Init START
Spotllm:73025:73112 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml
Spotllm:73025:73112 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
Spotllm:73025:73112 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml
Spotllm:73025:73112 [0] graph/search.cc:719 NCCL WARN XML Import Channel : dev 1 not found.
Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:749 → 2
Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:756 → 2
Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:873 → 2
Spotllm:73025:73112 [0] NCCL INFO init.cc:921 → 2
Spotllm:73025:73112 [0] NCCL INFO init.cc:1396 → 2
Spotllm:73025:73112 [0] NCCL INFO group.cc:64 → 2 [Async thread]
Spotllm:73025:73025 [0] NCCL INFO group.cc:418 → 2
Spotllm:73025:73025 [0] NCCL INFO group.cc:95 → 2
Traceback (most recent call last):
File “run_clm_with_peft.py”, line 937, in
main()
File “run_clm_with_peft.py”, line 899, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py”, line 1780, in train
return inner_training_loop(
File “/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py”, line 1933, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File “/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py”, line 1255, in prepare
result = self._prepare_deepspeed(*args)
File “/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py”, line 1640, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/init.py”, line 176, in initialize
engine = DeepSpeedEngine(args=args,
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 262, in init
self._configure_distributed_model(model)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 1157, in _configure_distributed_model
self._broadcast_model()
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 1077, in _broadcast_model
dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py”, line 117, in log_wrapper
return func(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py”, line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py”, line 489, in _fn
return fn(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/torch.py”, line 205, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py”, line 72, in wrapper
return func(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1914, in broadcast
work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
XML Import Channel : dev 1 not found.
Spotllm:73025:73025 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE
[2024-03-29 17:38:19,073] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 73025) of binary: /usr/bin/python3
Traceback (most recent call last):
File “/home/azureuser/.local/bin/torchrun”, line 8, in
sys.exit(main())
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py”, line 812, in main
run(args)
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py”, line 803, in run
elastic_launch(
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_clm_with_peft.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-03-29_17:38:19
host : spotllm.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 73025)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.2 documentation