Errors running Inference Endpoint with quantized model

rfroli · September 11, 2023, 10:43am

I’m encountering an issue while trying to load a quantized model into an Inference Endpoint. The model in question is “TheBloke/Llama-2-70B-chat-GPTQ”, and I’m using an AWS CPU Medium setup with PyTorch.

The error log suggests that the issue may be related to an incompatible version of bitsandbytes. Here’s the log for reference:

Error log :

2023/09/11 11:06:43 ~ Detected the presence of a quantization_config attribute in the model’s configuration but you don’t have the correct bitsandbytes version to support int8 serialization. Please install the latest version of bitsandbytes with pip install --upgrade bitsandbytes.
2023/09/11 11:06:51 ~ entrypoint.sh: line 13: 7 Killed uvicorn webservice_starlette:app --host 0.0.0.0 --port 5000

Does anyone have any suggestions on how to resolve this? The goal is to deploy lower-cost, CPU-based inference endpoints in production.

Hatman · September 12, 2023, 10:17pm

Can you throw your .py into a container?

rfroli · September 14, 2023, 7:23am

I’m just getting started in this area, I’m unsure how to retrieve the .py file, since the Inference Endpoint is generated automatically upon model import…

Topic		Replies	Views
Can i create endpoint using quantized model? Inference Endpoints on the Hub	3	722	January 16, 2024
ERROR \| Expected a cuda device, but got: cpu Inference Endpoints on the Hub	1	953	January 1, 2024
4bit quantization on inference end point Inference Endpoints on the Hub	0	275	January 16, 2024
Endpoint issue with GPTQ Inference Endpoints on the Hub	0	219	January 23, 2024
Inference Api ( serverless ) Endpoint Inference Endpoints on the Hub	0	456	April 24, 2024

Errors running Inference Endpoint with quantized model

Related topics