Issue in deploying quantized meta-llama/Llama-3.1-8B-Instruct in aws sagemaker

I have used 4 bit quantization through bitsandbytes for meta-llama/Llama-3.1-8B-Instruct and tested using the following ecr images:

  1. huggingface-pytorch-tgi-inference:
    763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0
  2. pytorch-inference
    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.4.0-gpu-py311-cu124-ubuntu22.04-sagemaker

The code works fine locally. However, when tried deploying this on sagemaker endpoint, i see the following container crash logs :
i do see container crash logs again for both the ecr images:
2024-10-10T19:22:25,964 [WARN ] W-9003-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - File “/opt/ml/model/code/inference.py”, line 2, in
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named ‘transformers’

For your reference, the model.tar.gz would look like below:
model.tar.gz

|- model artifacts
|- code/
|- inference.py # Your inference script
|- requirements.txt # Optional, used to install additional dependencies (if supported by your framework version)

Also the requirements.txt as follows:

transformers>=4.45
accelerate==0.34.2
bitsandbytes==0.44.1
peft==0.13.1

This is strange issue as i can see it working locally with a custom docker image built on top of the above mentioned ECR images. May i know if there is a resolution for this or if I’m missing anything?

1 Like