Issues with building extensions in Deepspeed

I get this error even after following the instructions on Deepspeed installation page and HF Trainer docs. Can anyone suggest how to fix this ?
I’ve followed @stas’s replies on GH issues, but this error keeps coming up. pybind11 is installed on my machine, not sure what is this indicating ?

@prajjwal1 - something is off in your python env.

The mantra should always be - let’s see if others have asked this question already, so here try https://www.google.com/search?q=pybind11%2Fpybind11.h+no+such+file+or+directory

Please avoid posting images for tracebacks - instead copy-n-paste the output using code blocks - this is because it’s impossible to copy-n-paste from the image to do the search for you.

That’s said if the issue continues after you tried the solutions offered at the top matching pages - please post the details at Issues · microsoft/DeepSpeed · GitHub

1 Like

Okay, I think I should have provided more details in the sense that you think I didn’t look elsewhere, so here it is.
The issue comes when the Trainer tries to use cpu_adam extension. Now there are two ways, JIT way of building extension on fly, and prebuilding the extensions. JIT way and prebuilding fails because of the error trace related to pybind11. So I search for cpu_adam, I land upon this issue. Now this issue is open, you provided some details which I did follow. Trainer docs, local rank doesn’t need to configured as you said. I did set PATH and LD_LIBRARY_PATH as well.
I did see this also, where in I tried pre-building:



git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel

This doesn’t seem to work for the same reason. Looked into 1, 2 as well.

Thank you for sharing the details on what you have already tried - that helps a lot.

I assume prebuilding failing with the same error of missing pybind11/pybind11.h, correct?

So, next, let’s do a very simple thing? Check that you have pybind11/pybind11.h in your python environment that you use to build deepspeed.

I see that it comes bundled with torch, so e.g. I have it under my conda env:

/mnt/nvme1/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h

So in your case there should be one under /usr/lib/python3/dist-packages/torch/include since this is one of -Is include statements in your error log. So for some reason you probably lack:

/usr/lib/python3/dist-packages/torch/include/pybind11/pybind11.h

and if it is so, then your package is borked. I have no idea how you came about this package, it appears like some non-conda/non-pip distribution package - but is it possible that it has torch and torch-dev packages? In which case you’re missing the dev side of things, which would have all the required headers. At least that’s how all .deb packages are made.

So if you installed this package as:

apt install torch

you now need:

apt install torch-dev

At this point you probably want to tell us what OS/dist you’re on and how you installed torch.

Thanks for replying. So I don’t see pybind11 in the include directory. I see ATen, c10, c10d, caffe2, fp16, TH, THC, THCUNN and torch. I do not have root rights on the server. Server has 20.04 Ubuntu. I installed torch via pip. I’m not sure why do we need to use apt for this or maybe you’re just showing it as an example if i understand correctly.

I want to train CTRL on a 24x4 GPUs but since it doesn’t have model parallel built in, it won’t even fit on one GPU with single batch size.

Most likely you’re not using pytorch you installed but a system-wide installed one. If you don’t have sudo access you won’t have been able to install torch into system-wide dirs, and your error message shows that it a system-wide torch you’re trying to use.

Before you edited this last comment - you shared how you installed it - so based on that info we can validate that the distributed package is whole:

$ wget https://download.pytorch.org/whl/cu111/torch-1.8.1%2Bcu111-cp38-cp38-linux_x86_64.whl
$ unzip torch-1.8.1+cu111-cp38-cp38-linux_x86_64.whl
$ find torch | grep pybind11.h
torch/include/pybind11/pybind11.h

The header is there.

Whoever installed torch on that system, didn’t do a full install. And you’re attempting to use the system-wide torch and not the torch you presumably installed yourself.

I recommend you create a dedicated conda env (or any other preferred python virt env of your choice), activate it, install torch and whatever else you need and everything will just work.

Of course, make sure you’re inside the virtual environment when you run your code.

Alternatively contact your sysadmin and request installing the missing files. I was guessing that this was some kind of custom .deb package which only installed the essentials and left out any dev files, hence the discussion of apt.

Of course, you can also do pip install pybind11 to solve the immediate problem, but chances are that some other required by torch header files will be missing. So it’s best to solve this problem at the root.

Thank you very much for swift and detailed replies. I figured out about the torch installation after I see that pybind11 is not present. Will reinstall it on my end. Hopefully it should be fixed once I verify that the pybind11 headers are present.

EDIT: Installing torch 1.8 worked. I ensured pybind11 is present.

1 Like