Code RuntimeError

zhangzhang9999 · October 22, 2023, 11:15am

when I use the accelerate config commands,I set the parameter as follows:
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/no]:no
Do you want to use DeepSpeed? [yes/no]: no
Do you want to use FullyShardedDataParallel? [yes/no]: no
Do you want to use Megatron-LM ? [yes/no]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1

when I input the accelerate launch main.py --temperature 0.2 --n_samples 1

The programming has been struck Selected ：
Tasks: [‘humaneval’]
Loading model in fp32
Loading model via these GPUs & max memories: {0: ‘40GB’, 1: ‘40GB’}
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.36s/it]
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.39s/it]
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
number of problems for this task is 164
0%| | 0/82 [00:00<?, ?it/s]

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801145 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63148) of binary: /root/anaconda/envs/bigcode/bin/python

muellerzr · October 22, 2023, 2:01pm

Can you share your script? This is because it looks like a timeout was hit before all_gather could actually occur

zhangzhang9999 · October 22, 2023, 2:30pm

#The problem has been solved export NCCL_ P2P_ DISABLE=1, I found that after loading one piece of data, an error will be reported and I have set the export NCCL_ P2P_ DISABLE=1,but I have the new problem.
warnings.warn(
number of problems for this task is 164
1%|▋ | 1/82 [00:54<1:13:20, 54.33s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 61144 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 61145) of binary: /root/anaconda/envs/bigcode1/bin/python
Traceback (most recent call last):
File “/root/anaconda/envs/bigcode1/bin/accelerate”, line 8, in
sys.exit(main())
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py”, line 47, in main
args.func(args)
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py”, line 977, in launch_command
multi_gpu_launcher(args)
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/accelerate/commands/launch.py”, line 646, in multi_gpu_launcher
distrib_run.run(args)
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/root/anaconda/envs/bigcode1/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

01.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:
time : 2023-10-22_12:57:04
host : rt-res-public9-6f8f8bd4fc-92zc9
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 61145)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 61145

Topic		Replies	Views
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2430	June 20, 2025
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	621	August 15, 2024
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14604	February 9, 2023
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1373	March 3, 2025
Troubleshooting help? Everything just hangs 🤗Accelerate	2	3370	July 12, 2022

Code RuntimeError

01.py FAILED

Failures:

Root Cause (first observed failure):

Related topics