Crash happened with accelerate + deepspeed

jasonme · May 20, 2022, 2:19pm

Hi Guys,

First of all, thanks a lot to all the wonderful works you guys have been delivering with transformers and its various extensions.

My question is:

I was training a huge model on a A100 machine (8 GPUs, each with lots of GPU memory). But still GPU memory is experiencing OOM issues. So I configured accelerate with deepspeed support:

accelerate config:

1 machine
8 GPUs
with deepspeed

However after model is loaded to the GPU memory, before the training starts, the program crashed at:

ImportError: /root/.cache/torch_extensions/py39_cu113/utils/utils.so: cannot open shared object file: No such file or directory

My environment is:

Python: 3.9.5
Torch: 1.11.0+cu113

Everything was running OK without deepspeed support. Please show me some path I can follow.

Thanks!

jindaliuzi · July 8, 2022, 7:38am

Have you found the solution? I’m facing the same issue

Topic		Replies	Views
Accelerate throws CUDA: OOM 🤗Accelerate	0	437	August 22, 2024
`run_translation.py` example is erroring out with the recommended settings DeepSpeed	1	6182	April 4, 2022
CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free Beginners	0	176	December 14, 2024
Accelerate deepspeed cache mount 🤗Accelerate	1	1416	November 23, 2023
No module named 'deepspeed.checkpoint.utils' DeepSpeed	6	2107	June 28, 2023

Crash happened with accelerate + deepspeed

Related topics