Can i create endpoint using quantized model?

hello everyone, I’m trying to create endpoint using quantied model ''TheBloke/llama-2-7B-chat-GPTQ" but facing errors.

So I’m wondering if i could create endpoint using it!

What errors are you facing? Are you selection “gptq” in the advanced section?

Our organization is also testing out HF inference endpoints.

It’s not totally clear from the documentation how to run quantized models.
For example if I select the model mistral-7b-instruct-v0-1 and select GPTQ in the container configuration, will it work? I just see errors in the logs and failings to update.

And what if I choose a quantized model to begin with such as TheBloke/Mistral-7B-Instruct-v0.1-GPTQ or TheBloke/Mistral-7B-OpenOrca-AWQ, what do I select in the containter config for quantization? I assume None.

In either case I have not been able to successfully get an endpoint to work yet with a quantized model.

Like it’s not clear whether “GPTQ in the container configuration” is saying quantize this model or if its saying I chose a quantized model so make sure to check this box.

Were you able to get an answer for this? :thinking: