Accelerate test stuck on training

vican9000 · March 23, 2023, 5:04pm

I am using two A100 GPUs to train some audio models on Jupyter environment (Google Compute instance). Recently I got interested in the accelerate package and adjusted the code accordingly (it was rather straightforward) but it always gets stuck on accelerator.backward(loss).
Today I tried “accelerate test” and it ALSO gets stuck.

Running:  accelerate-launch /opt/conda/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py
stderr: WARNING:accelerate.commands.launch:The following values were not passed to `accelerate launch` and had defaults used instead:
stderr:         `--dynamo_backend` was set to a value of `'no'`
stderr: To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 1 0 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1')tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0')  <class 'accelerate.data_loader.DataLoaderShard'><class 'accelerate.data_loader.DataLoaderShard'>
stdout:
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:

accelerate config is setup as multi-GPU (2 cards), utilizing all cards, NO to all other options.

miguelcarv · January 11, 2024, 7:24pm

Have you been able to find a solution to this issue?

marcsun13 · January 24, 2024, 5:13pm

Hi ! Can you try to install the latest accelerate and reboot your env ? LMK how is goes !

Topic		Replies	Views
Accelerator.backward(loss) never done! 🤗Accelerate	3	1571	March 9, 2023
Accelerate Distributed Randomly Hangs 🤗Accelerate	0	82	September 11, 2024
Issue with accelerator.backward(loss) freezing 🤗Accelerate	0	539	January 6, 2024
Why my Accelerate just doesn't work? 🤗Accelerate	2	6248	March 7, 2022
Accelerator() causes Error Beginners	2	379	April 12, 2024

Accelerate test stuck on training

Related topics