Multi-GPU Training sometimes working with 2GPU, but never more than 2

mario-dg · July 14, 2023, 9:49am

Hey everybody,
for my masters thesis I’m currently trying to run class conditional diffusion on microscopy images.
For this I need images with a resolution of 512x512, so I’m relying on a compute cluster provided by my university. Training on 1 GPU results in an epoch time of 32-45min, which is not at all doable for me. But I can’t seem to get Multi-GPU working correctly. Following are my specs:

- `Accelerate` version: 0.21.0.dev0
- Platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- System RAM: 355.40 GB
- GPU type: Quadro P6000
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Docker version 23.0.1, build a5ee5b1
GPUS: 10x Quadro P6000 24GB`

And this is my error message, when I try to run `accelerate test`:
`
Running:  accelerate-launch /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2023-07-13 20:19:50,932] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-13 20:19:55,740] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-13 20:19:55,745] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-13 20:19:55,758] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-13 20:19:55,763] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: [20:20:03] ERROR    failed (exitcode: -7) local_rank: 0 (pid: 927) of binary: /opt/conda/envs/accelerate/bin/python3                                                             api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/accelerate/bin/accelerate:8 in <module>                                          │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?
```, '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py:45  │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/test.py:54 in         │
│ test_command                                                                                     │
│                                                                                                  │
│   51 │   │   test_args = f"--config_file={args.config_file} {script_name}"                       │
│   52 │                                                                                           │
│   53 │   cmd = ["accelerate-launch"] + test_args.split()                                         │
│ ❱ 54 │   result = execute_subprocess_async(cmd, env=os.environ.copy())                           │
│   55 │   if result.returncode == 0:                                                              │
│   56 │   │   print("Test is a success! You are ready for your distributed training!")            │
│   57                                                                                             │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/testing.py:383 in   │
│ execute_subprocess_async                                                                         │
│                                                                                                  │
│   380 │   cmd_str = " ".join(cmd)                                                                │
│   381 │   if result.returncode > 0:                                                              │
│   382 │   │   stderr = "\n".join(result.stderr)                                                  │
│ ❱ 383 │   │   raise RuntimeError(                                                                │
│   384 │   │   │   f"'{cmd_str}' failed with returncode {result.returncode}\n\n"                  │
│   385 │   │   │   f"The combined stderr from workers follows:\n{stderr}"                         │
│   386 │   │   )                                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: 'accelerate-launch /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/accelerate/bin/accelerate-launch:8 in <module>                                   │
│                                                                                                  │
│   5 from accelerate.commands.launch import main                                                  │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?
```, '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py:975 in main │
│                                                                                                  │
│   972 def main():                                                                                │
│   973 │   parser = launch_command_parser()                                                       │
│   974 │   args = parser.parse_args()                                                             │
│ ❱ 975 │   launch_command(args)                                                                   │
│   976                                                                                            │
│   977                                                                                            │
│   978 if __name__ == "__main__":                                                                 │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py:960 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   957 │   elif args.use_megatron_lm and not args.cpu:                                            │
│   958 │   │   multi_gpu_launcher(args)                                                           │
│   959 │   elif args.multi_gpu and not args.cpu:                                                  │
│ ❱ 960 │   │   multi_gpu_launcher(args)                                                           │
│   961 │   elif args.tpu and not args.cpu:                                                        │
│   962 │   │   if args.tpu_use_cluster:                                                           │
│   963 │   │   │   tpu_pod_launcher(args)                                                         │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py:649 in      │
│ multi_gpu_launcher                                                                               │
│                                                                                                  │
│   646 │   )                                                                                      │
│   647 │   with patch_environment(**current_env):                                                 │
│   648 │   │   try:                                                                               │
│ ❱ 649 │   │   │   distrib_run.run(args)                                                          │
│   650 │   │   except Exception:                                                                  │
│   651 │   │   │   if is_rich_available() and debug:                                              │
│   652 │   │   │   │   console = get_console()                                                    │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/run.py:785 in run       │
│                                                                                                  │
│   782 │   │   )                                                                                  │
│   783 │                                                                                          │
│   784 │   config, cmd, cmd_args = config_from_args(args)                                         │
│ ❱ 785 │   elastic_launch(                                                                        │
│   786 │   │   config=config,                                                                     │
│   787 │   │   entrypoint=cmd,                                                                    │
│   788 │   )(*cmd_args)                                                                           │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/launcher/api.py:134 in  │
│ __call__                                                                                         │
│                                                                                                  │
│   131 │   │   self._entrypoint = entrypoint                                                      │
│   132 │                                                                                          │
│   133 │   def __call__(self, *args):                                                             │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))                    │
│   135                                                                                            │
│   136                                                                                            │
│   137 def _get_entrypoint_name(                                                                  │
│                                                                                                  │
│ /opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/launcher/api.py:250 in  │
│ launch_agent                                                                                     │
│                                                                                                  │
│   247 │   │   │   # if the error files for the failed children exist                             │
│   248 │   │   │   # @record will copy the first error (root cause)                               │
│   249 │   │   │   # to the error file of the launcher process.                                   │
│ ❱ 250 │   │   │   raise ChildFailedError(                                                        │
│   251 │   │   │   │   name=entrypoint_name,                                                      │
│   252 │   │   │   │   failures=result.failures,                                                  │
│   253 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError:
============================================================
/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-13_20:20:03
  host      : d412dfc663fe
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 928)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 928
[2]:
  time      : 2023-07-13_20:20:03
  host      : d412dfc663fe
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 929)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 929
[3]:
  time      : 2023-07-13_20:20:03
  host      : d412dfc663fe
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 930)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 930
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-13_20:20:03
  host      : d412dfc663fe
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 927)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 927
============================================================
ERROR conda.cli.main_run:execute(47): `conda run accelerate test` failed. (See above for error)

Any help would be greatly appreciated!

muellerzr · July 14, 2023, 10:29am

Try pulling down the latest main, i believe this was fixed yesterday

mario-dg · July 14, 2023, 12:31pm

Thank you for the tip, but sadly this didn’t work. I received the same error, but with additional info:

Running:  accelerate-launch /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: 
stdout: ===================================BUG REPORT===================================
stdout: Welcome to bitsandbytes. For bug reports, please run
stdout: 
stdout: python -m bitsandbytes
stdout: 
stdout:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
stdout: ================================================================================
stdout: bin /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
stdout: CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
stdout: CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
stdout: CUDA SETUP: Highest compute capability among GPUs detected: 6.1
stdout: CUDA SETUP: Detected CUDA version 112
stdout: CUDA SETUP: Loading binary /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
stdout: [2023-07-14 12:27:40,205] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: 
stdout: ===================================BUG REPORT===================================
stdout: Welcome to bitsandbytes. For bug reports, please run
stdout: 
stdout: python -m bitsandbytes
stdout: 
stdout:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
stdout: ================================================================================
stdout: bin /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
stdout: CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
stdout: CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
stdout: CUDA SETUP: Highest compute capability among GPUs detected: 6.1
stdout: CUDA SETUP: Detected CUDA version 112
stdout: CUDA SETUP: Loading binary /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
stdout: 
stdout: ===================================BUG REPORT===================================
stdout: Welcome to bitsandbytes. For bug reports, please run
stdout: 
stdout: python -m bitsandbytes
stdout: 
stdout:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
stdout: ================================================================================
stdout: 
stdout: ===================================BUG REPORT===================================
stdout: Welcome to bitsandbytes. For bug reports, please run
stdout: 
stdout: python -m bitsandbytes
stdout: 
stdout:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
stdout: ================================================================================
stdout: bin /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
stdout: CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
stdout: CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
stdout: CUDA SETUP: Highest compute capability among GPUs detected: 6.1
stdout: CUDA SETUP: Detected CUDA version 112
stdout: CUDA SETUP: Loading binary /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
stdout: bin /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
stdout: CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
stdout: CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
stdout: CUDA SETUP: Highest compute capability among GPUs detected: 6.1
stdout: CUDA SETUP: Detected CUDA version 112
stdout: CUDA SETUP: Loading binary /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
stdout: 
stdout: ===================================BUG REPORT===================================
stdout: Welcome to bitsandbytes. For bug reports, please run
stdout: 
stdout: python -m bitsandbytes
stdout: 
stdout:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
stdout: ================================================================================
stdout: bin /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
stdout: CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
stdout: CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
stdout: CUDA SETUP: Highest compute capability among GPUs detected: 6.1
stdout: CUDA SETUP: Detected CUDA version 112
stdout: CUDA SETUP: Loading binary /opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
stdout: [2023-07-14 12:27:45,766] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-14 12:27:46,053] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-14 12:27:46,056] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2023-07-14 12:27:46,056] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
Traceback (most recent call last):
  File "/opt/conda/envs/accelerate/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/testing.py", line 391, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/accelerate did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_oarpjszp/none_ylpto1p4/attempt_0/2/error.json')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/accelerate did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_oarpjszp/none_ylpto1p4/attempt_0/3/error.json')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/accelerate did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_oarpjszp/none_ylpto1p4/attempt_0/0/error.json')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/accelerate did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_oarpjszp/none_ylpto1p4/attempt_0/1/error.json')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda/envs/accelerate did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
/opt/conda/envs/accelerate/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 2218) of binary: /opt/conda/envs/accelerate/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/accelerate/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py", line 985, in main
    launch_command(args)
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/accelerate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/accelerate/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-14_12:27:53
  host      : f950bbcdba9b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 2219)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2219
[2]:
  time      : 2023-07-14_12:27:53
  host      : f950bbcdba9b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 2220)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2220
[3]:
  time      : 2023-07-14_12:27:53
  host      : f950bbcdba9b
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 2221)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2221
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-14_12:27:53
  host      : f950bbcdba9b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 2218)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2218
============================================================
ERROR conda.cli.main_run:execute(47): `conda run accelerate test` failed. (See above for error)

Nadav-Timor · July 15, 2023, 7:08pm

What was the fix/bug? Can you please send a link to the commit/PR? It seems like there have been multiple changes during the last few days.

mario-dg · July 17, 2023, 12:06pm

Any ideas or updates about this?

mohitattarde · August 8, 2024, 5:34pm

Your nvcc --version (CUDA) is 11.2 and your torch version is 11.7
They should be same .

Topic		Replies	Views
Multi-GPU Training using Accelerate: RAM Issue Leading to Failure 🤗Accelerate	0	93	July 16, 2024
Multi-gpu training does not optimize as expected Beginners	1	450	February 26, 2024
Multi-GPU is slower than single GPU when running examples 🤗Accelerate	2	450	July 24, 2024
Stable diffusion `train_text_to_image.py` only on one gpu 🧨 Diffusers	5	1191	May 2, 2023
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6278	October 13, 2021

Multi-GPU Training sometimes working with 2GPU, but never more than 2

Related topics