Hi there ! I need help for this one.
I converted a finetuned model to a GPTQ model. It works well. Now I want to deploy it with an inference endpoint, so I did, I created an endpoint with a custom handler. My handler works well, I tested it locally, the inference is really quick, so everything seems perfect. Problem is, the endpoint doesn’t work as well as my inference locally when I test the handler.
When I want to test my endpoint in the Overview, it returns a 504 error. When I test it with the API, it generate the correct ouput, but it takes 1min30 + to generate it (although it took 8 sec when I tested the handler locally)
I don’t really understand what’s wrong. Since then, all of my endpoints worked very well, but here, with the GPTQ model, it doesn’t work properly.
Do you have a clue about what’s going on and how to solve it ?