No GPUs found in a machine definitely with GPUs

Trusure · March 1, 2023, 4:33am

the error is below:

stderr: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_scrip │
stderr: │ t.py:336 in <module>                                                                             │
stderr: │                                                                                                  │
stderr: │   333                                                                                            │
stderr: │   334                                                                                            │
stderr: │   335 if __name__ == "__main__":                                                                 │
stderr: │ ❱ 336 │   main()                                                                                 │
stderr: │   337                                                                                            │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_scrip │
stderr: │ t.py:305 in main                                                                                 │
stderr: │                                                                                                  │
stderr: │   302                                                                                            │
stderr: │   303                                                                                            │
stderr: │   304 def main():                                                                                │
stderr: │ ❱ 305 │   accelerator = Accelerator()                                                            │
stderr: │   306 │   state = accelerator.state                                                              │
stderr: │   307 │   if state.local_process_index == 0:                                                     │
stderr: │   308 │   │   print("**Initialization**")                                                        │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:323 in         │
stderr: │ __init__                                                                                         │
stderr: │                                                                                                  │
stderr: │    320 │   │   │   │   │   │   self.init_handler = handler                                       │
stderr: │    321 │   │                                                                                     │
stderr: │    322 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}   │
stderr: │ ❱  323 │   │   self.state = AcceleratorState(                                                    │
stderr: │    324 │   │   │   mixed_precision=mixed_precision,                                              │
stderr: │    325 │   │   │   cpu=cpu,                                                                      │
stderr: │    326 │   │   │   dynamo_backend=dynamo_backend,                                                │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/state.py:162 in __init__      │
stderr: │                                                                                                  │
stderr: │   159 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:                  │
stderr: │   160 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU                          │
stderr: │   161 │   │   │   │   if not torch.distributed.is_initialized():                                 │
stderr: │ ❱ 162 │   │   │   │   │   torch.distributed.init_process_group(backend="nccl", **kwargs)         │
stderr: │   163 │   │   │   │   │   self.backend = "nccl"                                                  │
stderr: │   164 │   │   │   │   self.num_processes = torch.distributed.get_world_size()                    │
stderr: │   165 │   │   │   │   self.process_index = torch.distributed.get_rank()                          │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:76 │
stderr: │ 1 in init_process_group                                                                          │
stderr: │                                                                                                  │
stderr: │    758 │   │   │   # different systems (e.g. RPC) in case the store is multi-tenant.             │
stderr: │    759 │   │   │   store = PrefixStore("default_pg", store)                                      │
stderr: │    760 │   │                                                                                     │
stderr: │ ❱  761 │   │   default_pg = _new_process_group_helper(                                           │
stderr: │    762 │   │   │   world_size,                                                                   │
stderr: │    763 │   │   │   rank,                                                                         │
stderr: │    764 │   │   │   [],                                                                           │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:89 │
stderr: │ 7 in _new_process_group_helper                                                                   │
stderr: │                                                                                                  │
stderr: │    894 │   │   │   │   pg_options.is_high_priority_stream = False                                │
stderr: │    895 │   │   │   │   pg_options._timeout = timeout                                             │
stderr: │    896 │   │   │                                                                                 │
stderr: │ ❱  897 │   │   │   pg = ProcessGroupNCCL(prefix_store, group_rank, group_size, pg_options)       │
stderr: │    898 │   │   │   # In debug mode and if GLOO is available, wrap in a wrapper PG that           │
stderr: │    899 │   │   │   # enables enhanced collective checking for debugability.                      │
stderr: │    900 │   │   │   if get_debug_level() == DebugLevel.DETAIL:                                    │
stderr: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
stderr: RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
stderr: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_scrip │
stderr: │ t.py:336 in <module>                                                                             │
stderr: │                                                                                                  │
stderr: │   333                                                                                            │
stderr: │   334                                                                                            │
stderr: │   335 if __name__ == "__main__":                                                                 │
stderr: │ ❱ 336 │   main()                                                                                 │
stderr: │   337                                                                                            │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_scrip │
stderr: │ t.py:305 in main                                                                                 │
stderr: │                                                                                                  │
stderr: │   302                                                                                            │
stderr: │   303                                                                                            │
stderr: │   304 def main():                                                                                │
stderr: │ ❱ 305 │   accelerator = Accelerator()                                                            │
stderr: │   306 │   state = accelerator.state                                                              │
stderr: │   307 │   if state.local_process_index == 0:                                                     │
stderr: │   308 │   │   print("**Initialization**")                                                        │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:323 in         │
stderr: │ __init__                                                                                         │
stderr: │                                                                                                  │
stderr: │    320 │   │   │   │   │   │   self.init_handler = handler                                       │
stderr: │    321 │   │                                                                                     │
stderr: │    322 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}   │
stderr: │ ❱  323 │   │   self.state = AcceleratorState(                                                    │
stderr: │    324 │   │   │   mixed_precision=mixed_precision,                                              │
stderr: │    325 │   │   │   cpu=cpu,                                                                      │
stderr: │    326 │   │   │   dynamo_backend=dynamo_backend,                                                │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/accelerate/state.py:162 in __init__      │
stderr: │                                                                                                  │
stderr: │   159 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:                  │
stderr: │   160 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU                          │
stderr: │   161 │   │   │   │   if not torch.distributed.is_initialized():                                 │
stderr: │ ❱ 162 │   │   │   │   │   torch.distributed.init_process_group(backend="nccl", **kwargs)         │
stderr: │   163 │   │   │   │   │   self.backend = "nccl"                                                  │
stderr: │   164 │   │   │   │   self.num_processes = torch.distributed.get_world_size()                    │
stderr: │   165 │   │   │   │   self.process_index = torch.distributed.get_rank()                          │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:76 │
stderr: │ 1 in init_process_group                                                                          │
stderr: │                                                                                                  │
stderr: │    758 │   │   │   # different systems (e.g. RPC) in case the store is multi-tenant.             │
stderr: │    759 │   │   │   store = PrefixStore("default_pg", store)                                      │
stderr: │    760 │   │                                                                                     │
stderr: │ ❱  761 │   │   default_pg = _new_process_group_helper(                                           │
stderr: │    762 │   │   │   world_size,                                                                   │
stderr: │    763 │   │   │   rank,                                                                         │
stderr: │    764 │   │   │   [],                                                                           │
stderr: │                                                                                                  │
stderr: │ /home/chenzhixuan/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:89 │
stderr: │ 7 in _new_process_group_helper                                                                   │
stderr: │                                                                                                  │
stderr: │    894 │   │   │   │   pg_options.is_high_priority_stream = False                                │
stderr: │    895 │   │   │   │   pg_options._timeout = timeout                                             │
stderr: │    896 │   │   │                                                                                 │
stderr: │ ❱  897 │   │   │   pg = ProcessGroupNCCL(prefix_store, group_rank, group_size, pg_options)       │
stderr: │    898 │   │   │   # In debug mode and if GLOO is available, wrap in a wrapper PG that           │
stderr: │    899 │   │   │   # enables enhanced collective checking for debugability.                      │
stderr: │    900 │   │   │   if get_debug_level() == DebugLevel.DETAIL:                                    │
stderr: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
stderr: RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

but my config is multi-GPU distributed mode:

- `Accelerate` version: 0.16.0
- Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.31
- Python version: 3.9.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: [0,1]
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

violasara · March 23, 2023, 7:47am

I’m having the same problem. I use the following test-script:

from accelerate import Accelerator
import torch


if __name__ == "__main__":
    print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
    accelerator = Accelerator()
    print(accelerator.state)

Started without accelerate, the output is

Cuda support: True : 8 devices
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no

Started with accelerate launch test.py, torch.cuda.is_available() returns false:

Cuda support: False : 0 devices
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: no

The reason seems to be that after creating a subprocess (e.g. in line 397 of accelerate/commands/launch.py), calling torch.cuda.is_available() from the subprocess is not working. Am I missing something?

Trusure · March 23, 2023, 9:50am

I haven’t figure it out

muellerzr · March 23, 2023, 11:42am

What is accelerate env’s output currently?

violasara · March 23, 2023, 11:56am

- `Accelerate` version: 0.17.1
- Platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: [3,4,5]
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - dynamo_config: {}

Sorry, I updated accelerate in the meantime. It is line 574 now in launch.py

violasara · March 24, 2023, 12:38pm

Hi, do you have any suggestion? Any help would be appreciated

muellerzr · March 24, 2023, 1:52pm

Can you open an issue on our github for this and I can take a look? Please ensure to follow all the bits we ask for on there Sign in to GitHub · GitHub

violasara · March 28, 2023, 5:08pm

I did my best: https://github.com/huggingface/accelerate/issues/1260

1TuanPham · December 27, 2023, 7:26pm

Apparently if there’s a mismatch in the cuda version, accelerate env will output False for a setup with gpu, so just a head up for everyone

Topic		Replies	Views
No GPUs found in distributed mode 🤗Accelerate	0	939	March 1, 2023
AutoTrain GPU Not Found Error Beginners	0	386	August 11, 2023
Accelerate test stuck on training 🤗Accelerate	2	2341	January 24, 2024
RuntimeError: Found no NVIDIA driver on your system when running on NVIDIA A10G Large Spaces	3	10423	September 8, 2023
How to use specific gpu in accelerate? 🤗Accelerate	10	8058	April 25, 2024

No GPUs found in a machine definitely with GPUs

Related topics