when I use the accelerate config commands,I set the parameter as follows:
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/no]:no
Do you want to use DeepSpeed? [yes/no]: no
Do you want to use FullyShardedDataParallel? [yes/no]: no
Do you want to use Megatron-LM ? [yes/no]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
when I input the accelerate launch main.py --temperature 0.2 --n_samples 1
The programming has been struck Selected :
Tasks: [‘humaneval’]
Loading model in fp32
Loading model via these GPUs & max memories: {0: ‘40GB’, 1: ‘40GB’}
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The use_auth_token
argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.36s/it]
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:28<00:00, 9.39s/it]
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token
argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/anaconda/envs/bigcode/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The use_auth_token
argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
number of problems for this task is 164
0%| | 0/82 [00:00<?, ?it/s]
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801145 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63148) of binary: /root/anaconda/envs/bigcode/bin/python