Errors running Inference Endpoint with quantized model

I’m encountering an issue while trying to load a quantized model into an Inference Endpoint. The model in question is “TheBloke/Llama-2-70B-chat-GPTQ”, and I’m using an AWS CPU Medium setup with PyTorch.

The error log suggests that the issue may be related to an incompatible version of bitsandbytes. Here’s the log for reference:

Error log :

2023/09/11 11:06:43 ~ Detected the presence of a quantization_config attribute in the model’s configuration but you don’t have the correct bitsandbytes version to support int8 serialization. Please install the latest version of bitsandbytes with pip install --upgrade bitsandbytes.
2023/09/11 11:06:51 ~ line 13: 7 Killed uvicorn webservice_starlette:app --host --port 5000

Does anyone have any suggestions on how to resolve this? The goal is to deploy lower-cost, CPU-based inference endpoints in production.

Can you throw your .py into a container?

I’m just getting started in this area, I’m unsure how to retrieve the .py file, since the Inference Endpoint is generated automatically upon model import…