Endpoint issue with GPTQ

BlahBlah314 · January 23, 2024, 6:09pm

Hi there ! I need help for this one.

I converted a finetuned model to a GPTQ model. It works well. Now I want to deploy it with an inference endpoint, so I did, I created an endpoint with a custom handler. My handler works well, I tested it locally, the inference is really quick, so everything seems perfect. Problem is, the endpoint doesn’t work as well as my inference locally when I test the handler.
When I want to test my endpoint in the Overview, it returns a 504 error. When I test it with the API, it generate the correct ouput, but it takes 1min30 + to generate it (although it took 8 sec when I tested the handler locally)

I don’t really understand what’s wrong. Since then, all of my endpoints worked very well, but here, with the GPTQ model, it doesn’t work properly.

Do you have a clue about what’s going on and how to solve it ?

Topic		Replies	Views
Can i create endpoint using quantized model? Inference Endpoints on the Hub	3	721	January 16, 2024
ERROR \| Expected a cuda device, but got: cpu Inference Endpoints on the Hub	1	949	January 1, 2024
My inference endpoint went from 1 second to 20-30 seconds, even example Beginners	2	33	February 25, 2025
Unable to get inference results after deploying model to Inferende Endpoints Inference Endpoints on the Hub	0	13	May 8, 2025
Inference Endpoints fail to start Beginners	1	1821	August 3, 2023

Endpoint issue with GPTQ

Related topics