I have trained a model and using the new feature of deploying the adapter model directory from repo on Huggingface but I am getting the following error when generating a response from the endpoint. However when I used the same handler.py in colab it is giving response.
ERROR | Expected a cuda device, but got: cpu
Along with a warning : UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
I have a similar issue when using inference endpoints.
I defined a Custom Handler following the doc, but when I try to load the model (llava) using bitsandbytes to quantize, it fails because the GPU is not found.
Before trying to setup an endpoint i played around with a space via gradio using the same hardware specs and everything worked fine.
To me it seems like the GPU is not present when the endpoint is initialized, but I could be wrong as I’m completely new with using the inference endpoints