Crash happened with accelerate + deepspeed

Hi Guys,

First of all, thanks a lot to all the wonderful works you guys have been delivering with transformers and its various extensions.

My question is:

I was training a huge model on a A100 machine (8 GPUs, each with lots of GPU memory). But still GPU memory is experiencing OOM issues. So I configured accelerate with deepspeed support:

accelerate config:

1 machine
8 GPUs
with deepspeed

However after model is loaded to the GPU memory, before the training starts, the program crashed at:

ImportError: /root/.cache/torch_extensions/py39_cu113/utils/utils.so: cannot open shared object file: No such file or directory

My environment is:

Python: 3.9.5
Torch: 1.11.0+cu113

Everything was running OK without deepspeed support. Please show me some path I can follow.

Thanks!

Have you found the solution? I’m facing the same issue