I’m encountering an issue while trying to load a quantized model into an Inference Endpoint. The model in question is “TheBloke/Llama-2-70B-chat-GPTQ”, and I’m using an AWS CPU Medium setup with PyTorch.
The error log suggests that the issue may be related to an incompatible version of bitsandbytes
. Here’s the log for reference:
Error log :
2023/09/11 11:06:43 ~ Detected the presence of a quantization_config
attribute in the model’s configuration but you don’t have the correct bitsandbytes
version to support int8 serialization. Please install the latest version of bitsandbytes
with pip install --upgrade bitsandbytes
.
2023/09/11 11:06:51 ~ entrypoint.sh: line 13: 7 Killed uvicorn webservice_starlette:app --host 0.0.0.0 --port 5000
Does anyone have any suggestions on how to resolve this? The goal is to deploy lower-cost, CPU-based inference endpoints in production.