Hello HF Forum! I was trying to deploy the GPTQ quantized ‘llama-2-13b-hf’ model by Inference Endpoints. However, every time that I tried to initialize my quantized model with my custom handler, I always met ‘entrypoint.sh: line 13: 28 Killed uvicorn webservice_starlette:app --host 0.0.0.0 --port 5000’ error message in the log. The full process to initialize the model is repeated when meeting that error message… What is the problem? And How can I solve this problem? Please let me know!
My model is here: Cartinoe5930/llama-2-13B-GPTQ
P.S. I used 1x Tesla T4 to deploy the model. This is because loading a quantized model does not need much GPU RAM, so I chose that GPU.