Hi,
I am trying to run textural_inversion.py) script.
Before running the script, I followed the instruction for setting accelerate configuration as default: accelerate config default
The following is the result of accelerate env
:
- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.1.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.53 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: False
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: False
- tpu_use_cluster: False
- tpu_use_sudo: False
When I run the script and it printed out that codes run only on the device xla:0
, but my local machine only has 8 A100 GPUs.
I am now running the script in the docker container that has following torch libs:
torch 2.1.2+cu118
torch-xla 2.1.0
torchmetrics 1.0.3
torchvision 0.16.2+cu118
Outside of the container, the version of Nvidia Cuda driver is 11.4.:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:1B:00.0 Off | 0 |
| N/A 37C P0 41W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:1C:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... Off | 00000000:1D:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... Off | 00000000:1E:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100 80G... Off | 00000000:3D:00.0 Off | 0 |
| N/A 34C P0 40W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100 80G... Off | 00000000:3F:00.0 Off | 0 |
| N/A 36C P0 41W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100 80G... Off | 00000000:40:00.0 Off | 0 |
| N/A 36C P0 45W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100 80G... Off | 00000000:41:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
To inspect what’s going on, the following is the result when I ran accelerate test
:
Running: accelerate-launch /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 5
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 7
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 6
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stdout:
stdout: **Test process execution**
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 2
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 3
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 1
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 4
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stderr: WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
stderr: WARNING:root:Defaulting to PJRT_DEVICE=CPU
stdout:
stdout: **Test process execution**
stdout:
stdout: **Test process execution**
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: XLA Backend: xla
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: xla:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stdout:
stdout: **Test process execution**
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
stderr: assert f.getvalue().rstrip() == ""
stderr: with open(path) as f:
stderr: AssertionError
stderr: FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
stderr: with open(path) as f:
stderr: FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: assert f.getvalue().rstrip() == ""
stderr: AssertionError
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
stderr: process_execution_check()
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
stderr: assert f.getvalue().rstrip() == ""
stderr: AssertionError
stdout:
stdout: **Test split between processes as a list**
stdout:
stdout: **Test split between processes as a dict**
stdout:
stdout: **Test split between processes as a tensor**
stdout:
stdout: **Test split between processes as a datasets.Dataset**
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 0 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
stdout: device='xla:0') <class 'accelerate.data_loader.MpDeviceLoaderWrapper'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stderr: /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:315: UserWarning: In XLA autocast, but the target dtype is not supported. Disabling autocast.
stderr: XLA Autocast only supports dtype of torch.bfloat16 currently.
stderr: warnings.warn(error_message)
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: **Breakpoint trigger test**
stderr: [2024-07-02 12:14:36,020] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 129704 closing signal SIGTERM
stderr: [2024-07-02 12:14:36,437] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 129705) of binary: /usr/bin/python3.10
stderr: Traceback (most recent call last):
stderr: File "/usr/local/bin/accelerate-launch", line 8, in <module>
stderr: sys.exit(main())
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1088, in main
stderr: launch_command(args)
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
stderr: distrib_run.run(args)
stderr: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
stderr: elastic_launch(
stderr: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr: return launch_agent(self._config, self._entrypoint, list(args))
stderr: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
stderr: raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 2 (local_rank: 2)
stderr: exitcode : 1 (pid: 129706)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [2]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 3 (local_rank: 3)
stderr: exitcode : 1 (pid: 129707)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [3]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 4 (local_rank: 4)
stderr: exitcode : 1 (pid: 129708)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [4]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 5 (local_rank: 5)
stderr: exitcode : 1 (pid: 129709)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [5]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 6 (local_rank: 6)
stderr: exitcode : 1 (pid: 129710)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [6]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 7 (local_rank: 7)
stderr: exitcode : 1 (pid: 129711)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr: time : 2024-07-02_12:14:36
stderr: host : 4f95f1ff1378
stderr: rank : 1 (local_rank: 1)
stderr: exitcode : 1 (pid: 129705)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/test.py", line 53, in test_command
result = execute_subprocess_async(cmd)
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/testing.py", line 555, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch /usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
assert f.getvalue().rstrip() == ""
with open(path) as f:
AssertionError
FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
assert f.getvalue().rstrip() == ""
AssertionError
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 102, in process_execution_check
with open(path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'check_main_process_first.txt'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
assert f.getvalue().rstrip() == ""
AssertionError
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
assert f.getvalue().rstrip() == ""
AssertionError
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 804, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 746, in main
process_execution_check()
File "/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py", line 135, in process_execution_check
assert f.getvalue().rstrip() == ""
AssertionError
/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:315: UserWarning: In XLA autocast, but the target dtype is not supported. Disabling autocast.
XLA Autocast only supports dtype of torch.bfloat16 currently.
warnings.warn(error_message)
[2024-07-02 12:14:36,020] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 129704 closing signal SIGTERM
[2024-07-02 12:14:36,437] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 129705) of binary: /usr/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1088, in main
launch_command(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 129706)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 129707)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 129708)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 129709)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 129710)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 129711)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-02_12:14:36
host : 4f95f1ff1378
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 129705)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
How can I resolve this issue?
I am looking for a way to run the script on GPUs.