Hi Guys,
First of all, thanks a lot to all the wonderful works you guys have been delivering with transformers and its various extensions.
My question is:
I was training a huge model on a A100 machine (8 GPUs, each with lots of GPU memory). But still GPU memory is experiencing OOM issues. So I configured accelerate with deepspeed support:
accelerate config:
1 machine
8 GPUs
with deepspeed
However after model is loaded to the GPU memory, before the training starts, the program crashed at:
ImportError: /root/.cache/torch_extensions/py39_cu113/utils/utils.so: cannot open shared object file: No such file or directory
My environment is:
Python: 3.9.5
Torch: 1.11.0+cu113
Everything was running OK without deepspeed support. Please show me some path I can follow.
Thanks!