Can i create endpoint using quantized model?

Shamito · September 26, 2023, 10:49am

hello everyone, I’m trying to create endpoint using quantied model ''TheBloke/llama-2-7B-chat-GPTQ" but facing errors.

So I’m wondering if i could create endpoint using it!

philschmid · September 26, 2023, 11:53am

What errors are you facing? Are you selection “gptq” in the advanced section?

chrislevy · October 31, 2023, 4:12pm

Our organization is also testing out HF inference endpoints.

It’s not totally clear from the documentation how to run quantized models.
For example if I select the model mistral-7b-instruct-v0-1 and select GPTQ in the container configuration, will it work? I just see errors in the logs and failings to update.

And what if I choose a quantized model to begin with such as TheBloke/Mistral-7B-Instruct-v0.1-GPTQ or TheBloke/Mistral-7B-OpenOrca-AWQ, what do I select in the containter config for quantization? I assume None.

In either case I have not been able to successfully get an endpoint to work yet with a quantized model.

Like it’s not clear whether “GPTQ in the container configuration” is saying quantize this model or if its saying I chose a quantized model so make sure to check this box.

iarbel · January 16, 2024, 5:59pm

Were you able to get an answer for this?

Topic		Replies	Views
Errors running Inference Endpoint with quantized model Inference Endpoints on the Hub	2	794	September 14, 2023
Endpoint issue with GPTQ Inference Endpoints on the Hub	0	219	January 23, 2024
4bit quantization on inference end point Inference Endpoints on the Hub	0	275	January 16, 2024
Serving AWQ models without a custom container Inference Endpoints on the Hub	2	240	November 13, 2023
ERROR \| Expected a cuda device, but got: cpu Inference Endpoints on the Hub	1	953	January 1, 2024

Can i create endpoint using quantized model?

Related topics