Accelerator.device always show xla:0 not opus

chaeunl · July 2, 2024, 12:16pm

Hi,

I am trying to run textural_inversion.py) script.

Before running the script, I followed the instruction for setting accelerate configuration as default: accelerate config default

The following is the result of accelerate env:

- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.1.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.53 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: False
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: False
        - tpu_use_cluster: False
        - tpu_use_sudo: False

When I run the script and it printed out that codes run only on the device xla:0, but my local machine only has 8 A100 GPUs.

I am now running the script in the docker container that has following torch libs:

torch                     2.1.2+cu118
torch-xla                 2.1.0
torchmetrics              1.0.3
torchvision               0.16.2+cu118

Outside of the container, the version of Nvidia Cuda driver is 11.4.:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   37C    P0    41W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:1C:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:1D:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   34C    P0    40W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80G...  Off  | 00000000:3F:00.0 Off |                    0 |
| N/A   36C    P0    41W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80G...  Off  | 00000000:40:00.0 Off |                    0 |
| N/A   36C    P0    45W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80G...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To inspect what’s going on, the following is the result when I ran accelerate test:

Running:  accelerate-launch /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 5
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 7
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 6
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test process execution**
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 2
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 3
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 1
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 4
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test process execution**
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: XLA  Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: xla:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test process execution**
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
stderr:     assert f.getvalue().rstrip() == ""
stderr:     with open(path) as f:
stderr: AssertionError
stderr: FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
stderr:     with open(path) as f:
stderr: FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr:     assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr:     assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr:     process_execution_check()
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr:     assert f.getvalue().rstrip() == ""
stderr: AssertionError
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stdout: 
stdout: **Test split between processes as a tensor**
stdout: 
stdout: **Test split between processes as a datasets.Dataset**
stdout: 
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: 
stdout: **DataLoader integration test**
stdout: 0 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
stdout:        device='xla:0') <class 'accelerate.data_loader.MpDeviceLoaderWrapper'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: 
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stderr: /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:315: UserWarning: In XLA autocast, but the target dtype is not supported. Disabling autocast.
stderr: XLA Autocast only supports dtype of torch.bfloat16 currently.
stderr:   warnings.warn(error_message)
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: 
stdout: **Breakpoint trigger test**
stderr: [2024-07-02 12:14:36,020] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 129704 closing signal SIGTERM
stderr: [2024-07-02 12:14:36,437] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 129705) of binary: /usr/bin/python3.10
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1088, in main
stderr:     launch_command(args)
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
stderr:     elastic_launch(
stderr:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 2 (local_rank: 2)
stderr:   exitcode  : 1 (pid: 129706)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [2]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 3 (local_rank: 3)
stderr:   exitcode  : 1 (pid: 129707)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [3]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 4 (local_rank: 4)
stderr:   exitcode  : 1 (pid: 129708)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [4]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 5 (local_rank: 5)
stderr:   exitcode  : 1 (pid: 129709)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [5]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 6 (local_rank: 6)
stderr:   exitcode  : 1 (pid: 129710)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [6]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 7 (local_rank: 7)
stderr:   exitcode  : 1 (pid: 129711)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2024-07-02_12:14:36
stderr:   host      : 4f95f1ff1378
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 129705)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/test.py", line 53, in test_command
    result = execute_subprocess_async(cmd)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/testing.py", line 555, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
    assert f.getvalue().rstrip() == ""
    with open(path) as f:
AssertionError
FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    assert f.getvalue().rstrip() == ""
AssertionError
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
    with open(path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
    assert f.getvalue().rstrip() == ""
AssertionError
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
    assert f.getvalue().rstrip() == ""
AssertionError
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
    process_execution_check()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
    assert f.getvalue().rstrip() == ""
AssertionError
/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:315: UserWarning: In XLA autocast, but the target dtype is not supported. Disabling autocast.
XLA Autocast only supports dtype of torch.bfloat16 currently.
  warnings.warn(error_message)
[2024-07-02 12:14:36,020] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 129704 closing signal SIGTERM
[2024-07-02 12:14:36,437] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 129705) of binary: /usr/bin/python3.10
Traceback (most recent call last):
  File "/usr/local/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1088, in main
    launch_command(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 129706)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 129707)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 129708)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 129709)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 129710)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 129711)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-02_12:14:36
  host      : 4f95f1ff1378
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 129705)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How can I resolve this issue?
I am looking for a way to run the script on GPUs.

Topic		Replies	Views
Code RuntimeError:Multi-card operation 🤗Accelerate	1	669	March 9, 2024
Weight and shape different than the number of channels in input Intermediate	0	276	April 4, 2024
Accelerate is out of RAM 🤗Accelerate	1	1099	August 20, 2022
Accelerate doesn't seem to use my GPU? 🤗Accelerate	7	5653	September 18, 2024
How to use specific gpu in accelerate? 🤗Accelerate	10	7975	April 25, 2024

Accelerator.device always show xla:0 not opus

Related topics