I have fine-tuned a model using QLoRA (4-bit) and then pushed it to the hub. Then, I tried to test it with inference API, but I get the following error:
No package metadata was found for bitsandbytes
The model:
base model is meta-llama/Llama-2-7b-chat-hf
fine-tuned on a custom dataset of 50 samples (I am just testing around)
this is the colab notebook I used to train it. Note that, after the QLoRA training, I merged the adapter model with the base model, THEN pushed the result to the hub. So the model in the hub is a transformer model. Specifically, a transformers.models.llama.modeling_llama.LlamaForCausalLM
this is a colab notebook that can be used for testing. Note that the test works for the base model meta-llama/Llama-2-7b-chat-hf
My suspect is that the docker container behind the inference API does not know that it needs to install bitsandbytes. Is there a way to “tell it”? maybe a tag in the README?
First, make sure to have a requirements file with the packages listed.
Also, check that you have a torch version compatible with bitsandbytes.
Finally, if working with docker, you can either specify it into the requirements file or directly in the docker file by writing something like:
RUN pip install -U bitsandbytes
You can check locally by building the image with docker run to check how the building goes.
I went through the exact same process (fine-tune, merge, push).
Same problem here and same error using inference API.
I have also tried to deploy an inference endpoint and getting another error. Calling cuda()is not supported for4-bitor8-bit quantized models.
Same issue , same message on inference API (No package metadata was found for bitsandbytes). Tried to add transformers in requirements.txt, not working neither