I have quantized the meta-llama/Llama-3.1-8B-Instruct model using BitsAndBytesConfig. However when i try deploying it to sagemaker endpoint, it throws error. I traiged this in a docker container locally and see the same issue reproduced. Sharing the code of my docker file and the working snippets of jupyter notebook for your reference:
Docker code:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0
COPY requirements.txt .
RUN pip install -r requirements.txt
# Set the working directory
WORKDIR /opt/ml
# Copy your script into the container
COPY inference.py .
# Install your dependencies
# RUN pip install --upgrade "transformers>=4.46.0" accelerate bitsandbytes peft
# Run the script (if you want it to execute automatically)
ENTRYPOINT ["python", "inference.py"]
requirements.txt :
transformers==4.44.2
accelerate==0.34.2
bitsandbytes==0.44.1
peft==0.13.1
Errror:
sh-4.2$ docker run -it --gpus all -v /home/ec2-user/SageMaker/saved_models/Llama-3.1-8B-Instruct-test1:/opt/ml/model -e HF_MODEL_ID=/opt/ml/model 8b:v1 --model-id /opt/ml/model --quantize bitsandbytes
Unused kwargs: [βdevice_mapβ]. These kwargs are not used in <class βtransformers.utils.quantization_config.BitsAndBytesConfigβ>.
Unused kwargs: [β_load_in_4bitβ, β_load_in_8bitβ, βquant_methodβ]. These kwargs are not used in <class βtransformers.utils.quantization_config.BitsAndBytesConfigβ>.
/opt/conda/lib/python3.9/site-packages/transformers/quantizers/auto.py:174: UserWarning: You passed quantization_config
or equivalent parameters to from_pretrained
but the model youβre loading already has a quantization_config
attribute. The quantization_config
from the model will be used.
warnings.warn(warning_msg)
low_cpu_mem_usage
was None, now set to True since model is quantized.
DEBUG:bitsandbytes.cextension:Loading bitsandbytes native library from: /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
Loading checkpoint shards: 0%| Loading checkpoint shards: 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββLoading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:01<00:00, 1.41it/s]
ERROR:root:Error loading tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3
Traceback (most recent call last):
File β/opt/ml/inference.pyβ, line 70, in
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
File β/opt/conda/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.pyβ, line 897, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File β/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.pyβ, line 2271, in from_pretrained
return cls._from_pretrained(
File β/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.pyβ, line 2505, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File β/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_fast.pyβ, line 115, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3
Any suggestions or inputs on how to avoid this error?
I have the sagemaker endpoint deployment script as below:
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
# retrieve the llm image uri
# llm_image = get_huggingface_llm_image_uri(
# "huggingface",
# version="0.8.2"
# )
# print ecr image uri
print(f"llm image uri: {llm_image}")
# this would print: llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04
s3_model_uri = "s3uri"
# llm_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124"
llm_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"
# Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
}
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
model_data=s3_model_uri,
env=config
)
endpoint_name="model-ptq-test-llama-8b-v1"
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
# volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
endpoint_name=endpoint_name
)
It works fine locally in a jupyter notebook:
code snippets:
!pip install bitsandbytes==0.44.1
!pip install accelerate==0.34.2
!pip install transformers==4.44.2
!pip install peft==0.13.1
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import LlamaForCausalLM, LlamaTokenizer
from huggingface_hub import login
token="hugging Face token"
login(token=token, add_to_git_credential=True)
# model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
#model_id = "meta-llama/Llama-3.1-8B"
#model_id = "google-t5/t5-small"
cache_dir = "/home/ec2-user/SageMaker/huggingface_cache"
# model_id = "facebook/opt-350m"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
device_map="auto"
)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
test:
tokenizer = AutoTokenizer.from_pretrained("/home/ec2-user/SageMaker/saved_models/Llama-3.1-8B-Instruct-test1")
# Prepare your input text
# input_text = "Translate the following English text to French: 'Hello, how are you?'"
input_text = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>Translate the following English text to French:
'Hello, how are you?'<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
#input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
print(input_ids)
output = model_4bit.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Output:
Setting pad_token_id
to eos_token_id
:None for open-end generation.
{βinput_idsβ: tensor([[128000, 128000, 128006, 882, 128007, 28573, 279, 2768, 6498,
1495, 311, 8753, 512, 6, 9906, 11, 1268, 527,
499, 20837, 128009, 128006, 78191, 128007]], device=βcuda:0β), βattention_maskβ: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device=βcuda:0β)}
userTranslate the following English text to French:
'Hello, how are you?'assistant
βBonjour, comment vas-tu?β