ERROR | Expected a cuda device, but got: cpu

I have a similar issue when using inference endpoints.

I defined a Custom Handler following the doc, but when I try to load the model (llava) using bitsandbytes to quantize, it fails because the GPU is not found.

Before trying to setup an endpoint i played around with a space via gradio using the same hardware specs and everything worked fine.

To me it seems like the GPU is not present when the endpoint is initialized, but I could be wrong as I’m completely new with using the inference endpoints