env:
torch1.7 -cuda11.0 -nccl2.7.8
8 V100 GPUs
ubuntu
cmds:
pip install transformers
pip install accelerate
then I set up with accelerate config
:
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: No
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: no
Now the error comes out, when I check with accelerate test
, I got:
Running: accelerate-launch --config_file=None /home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout: Use FP16 precision: False
stdout:
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout: Use FP16 precision: False
stdout:
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Use FP16 precision: False
stdout:
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Use FP16 precision: False
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stderr: Traceback (most recent call last):
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr: Traceback (most recent call last):
stderr: main()
stderr: Traceback (most recent call last):
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr: Traceback (most recent call last):
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr: central_dl_preparation_check()
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr: for batch in dl:
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr: broadcast_object_list(batch_info)
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr: torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: main()
stderr: main()
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr: main()
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr: central_dl_preparation_check()
stderr: central_dl_preparation_check()
stderr: central_dl_preparation_check()
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr: for batch in dl:
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr: for batch in dl:
stderr: for batch in dl:
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr: broadcast_object_list(batch_info)
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr: broadcast_object_list(batch_info)
stderr: broadcast_object_list(batch_info)
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr: torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: Traceback (most recent call last):
stderr: File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
stderr: "__main__", mod_spec)
stderr: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
stderr: exec(code, run_globals)
stderr: File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
stderr: main()
stderr: File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
stderr: cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
stderr: Traceback (most recent call last):
stderr: File "/home/me/.local/bin/accelerate-launch", line 8, in <module>
stderr: sys.exit(main())
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 390, in main
stderr: launch_command(args)
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 378, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
stderr: raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/home/me/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/test.py", line 52, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/testing.py", line 135, in execute_subprocess_async
f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
RuntimeError: 'accelerate-launch --config_file=None /home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
Traceback (most recent call last):
main()
Traceback (most recent call last):
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
Traceback (most recent call last):
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
central_dl_preparation_check()
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
for batch in dl:
File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
broadcast_object_list(batch_info)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
torch.distributed.broadcast_object_list(object_list, src=from_process)
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
main()
main()
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
main()
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
central_dl_preparation_check()
central_dl_preparation_check()
central_dl_preparation_check()
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
for batch in dl:
File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
for batch in dl:
for batch in dl:
File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
broadcast_object_list(batch_info)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
broadcast_object_list(batch_info)
broadcast_object_list(batch_info)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
torch.distributed.broadcast_object_list(object_list, src=from_process)
torch.distributed.broadcast_object_list(object_list, src=from_process)
torch.distributed.broadcast_object_list(object_list, src=from_process)
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/home/me/.local/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 390, in main
launch_command(args)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 378, in launch_command
multi_gpu_launcher(args)
File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
what are all these things?
I have googled the msg AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
for hundreds of times but I don’t know how to fix it.
Could any one help me to figure out? Thanks a lot!!!