Why my Accelerate just doesn't work?

env:
torch1.7 -cuda11.0 -nccl2.7.8
8 V100 GPUs
ubuntu

cmds:

pip install transformers
pip install accelerate

then I set up with accelerate config:

Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: No
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: no

Now the error comes out, when I check with accelerate test, I got:


Running:  accelerate-launch --config_file=None /home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout: Use FP16 precision: False
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout: Use FP16 precision: False
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Use FP16 precision: False
stdout: 
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Use FP16 precision: False
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: 
stdout: **DataLoader integration test**
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stderr: Traceback (most recent call last):
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr: Traceback (most recent call last):
stderr:     main()
stderr: Traceback (most recent call last):
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr: Traceback (most recent call last):
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
stderr:     central_dl_preparation_check()
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr:     for batch in dl:
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr:     broadcast_object_list(batch_info)
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr:     torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr:     main()
stderr:     main()
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr:     main()
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
stderr:     central_dl_preparation_check()
stderr:     central_dl_preparation_check()
stderr:     central_dl_preparation_check()
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
stderr:     for batch in dl:
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr:     for batch in dl:
stderr:     for batch in dl:
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
stderr:     broadcast_object_list(batch_info)
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr:     broadcast_object_list(batch_info)
stderr:     broadcast_object_list(batch_info)
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
stderr:     torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr:     torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr:     torch.distributed.broadcast_object_list(object_list, src=from_process)
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
stderr: Traceback (most recent call last):
stderr:   File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
stderr:     "__main__", mod_spec)
stderr:   File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
stderr:     exec(code, run_globals)
stderr:   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
stderr:     main()
stderr:   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
stderr:     cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
stderr: Traceback (most recent call last):
stderr:   File "/home/me/.local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 390, in main
stderr:     launch_command(args)
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 378, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
stderr:     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/me/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/test.py", line 52, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/testing.py", line 135, in execute_subprocess_async
    f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
RuntimeError: 'accelerate-launch --config_file=None /home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
Traceback (most recent call last):
    main()
Traceback (most recent call last):
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
Traceback (most recent call last):
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 291, in <module>
    central_dl_preparation_check()
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
    for batch in dl:
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
    broadcast_object_list(batch_info)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
    torch.distributed.broadcast_object_list(object_list, src=from_process)
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
    main()
    main()
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
    main()
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 283, in main
    central_dl_preparation_check()
    central_dl_preparation_check()
    central_dl_preparation_check()
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py", line 120, in central_dl_preparation_check
    for batch in dl:
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
    for batch in dl:
    for batch in dl:
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/data_loader.py", line 362, in __iter__
    broadcast_object_list(batch_info)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
    broadcast_object_list(batch_info)
    broadcast_object_list(batch_info)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/utils.py", line 403, in broadcast_object_list
    torch.distributed.broadcast_object_list(object_list, src=from_process)
    torch.distributed.broadcast_object_list(object_list, src=from_process)
    torch.distributed.broadcast_object_list(object_list, src=from_process)
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/me/.local/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 390, in main
    launch_command(args)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 378, in launch_command
    multi_gpu_launcher(args)
  File "/home/me/.local/lib/python3.6/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/me/.local/lib/python3.6/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.

what are all these things?
I have googled the msg AttributeError: module 'torch.distributed' has no attribute 'broadcast_object_list' for hundreds of times but I don’t know how to fix it.

Could any one help me to figure out? Thanks a lot!!!

1 Like

@sgugger could you please help me?

Will have a look later today. I think it’s a problem with older versions of PyTorch so if you can upgrade, it should solve the issue, but will fix for older versions :slight_smile:

1 Like