I’m encountering an issue while trying to load a quantized model into an Inference Endpoint. The model in question is “TheBloke/Llama-2-70B-chat-GPTQ”, and I’m using an AWS CPU Medium setup with PyTorch.
The error log suggests that the issue may be related to an incompatible version of
bitsandbytes. Here’s the log for reference:
Error log :
2023/09/11 11:06:43 ~ Detected the presence of a
quantization_config attribute in the model’s configuration but you don’t have the correct
bitsandbytes version to support int8 serialization. Please install the latest version of
pip install --upgrade bitsandbytes.
2023/09/11 11:06:51 ~ entrypoint.sh: line 13: 7 Killed uvicorn webservice_starlette:app --host 0.0.0.0 --port 5000
Does anyone have any suggestions on how to resolve this? The goal is to deploy lower-cost, CPU-based inference endpoints in production.