Error loading tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3

I have quantized the meta-llama/Llama-3.1-8B-Instruct model using BitsAndBytesConfig. However when i try deploying it to sagemaker endpoint, it throws error. I traiged this in a docker container locally and see the same issue reproduced. Sharing the code of my docker file and the working snippets of jupyter notebook for your reference:
Docker code:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0

COPY requirements.txt .

RUN pip install -r requirements.txt

# Set the working directory
WORKDIR /opt/ml

# Copy your script into the container
COPY inference.py .

# Install your dependencies
# RUN pip install --upgrade "transformers>=4.46.0" accelerate bitsandbytes peft

# Run the script (if you want it to execute automatically)
ENTRYPOINT ["python", "inference.py"]

requirements.txt :

transformers==4.44.2
accelerate==0.34.2
bitsandbytes==0.44.1
peft==0.13.1

Errror:
sh-4.2$ docker run -it --gpus all -v /home/ec2-user/SageMaker/saved_models/Llama-3.1-8B-Instruct-test1:/opt/ml/model -e HF_MODEL_ID=/opt/ml/model 8b:v1 --model-id /opt/ml/model --quantize bitsandbytes
Unused kwargs: [β€˜device_map’]. These kwargs are not used in <class β€˜transformers.utils.quantization_config.BitsAndBytesConfig’>.
Unused kwargs: [β€˜_load_in_4bit’, β€˜_load_in_8bit’, β€˜quant_method’]. These kwargs are not used in <class β€˜transformers.utils.quantization_config.BitsAndBytesConfig’>.
/opt/conda/lib/python3.9/site-packages/transformers/quantizers/auto.py:174: UserWarning: You passed quantization_config or equivalent parameters to from_pretrained but the model you’re loading already has a quantization_config attribute. The quantization_config from the model will be used.
warnings.warn(warning_msg)
low_cpu_mem_usage was None, now set to True since model is quantized.
DEBUG:bitsandbytes.cextension:Loading bitsandbytes native library from: /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
Loading checkpoint shards: 0%| Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆLoading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:01<00:00, 1.41it/s]
ERROR:root:Error loading tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3
Traceback (most recent call last):
File β€œ/opt/ml/inference.py”, line 70, in
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
File β€œ/opt/conda/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py”, line 897, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File β€œ/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py”, line 2271, in from_pretrained
return cls._from_pretrained(
File β€œ/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py”, line 2505, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File β€œ/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py”, line 115, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3

Any suggestions or inputs on how to avoid this error?

I have the sagemaker endpoint deployment script as below:

import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300


# retrieve the llm image uri
# llm_image = get_huggingface_llm_image_uri(
#   "huggingface",
#   version="0.8.2"
# )

# print ecr image uri
print(f"llm image uri: {llm_image}")
# this would print: llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04

s3_model_uri = "s3uri"

# llm_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124"
llm_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data=s3_model_uri,
  env=config
)

endpoint_name="model-ptq-test-llama-8b-v1"

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
    endpoint_name=endpoint_name
)

It works fine locally in a jupyter notebook:
code snippets:

!pip install bitsandbytes==0.44.1
!pip install accelerate==0.34.2
!pip install transformers==4.44.2
!pip install peft==0.13.1
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import LlamaForCausalLM, LlamaTokenizer
from huggingface_hub import login

token="hugging Face token"

login(token=token, add_to_git_credential=True)

# model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
#model_id = "meta-llama/Llama-3.1-8B"
#model_id = "google-t5/t5-small"

cache_dir = "/home/ec2-user/SageMaker/huggingface_cache"

# model_id = "facebook/opt-350m"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    device_map="auto"
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

test:

tokenizer = AutoTokenizer.from_pretrained("/home/ec2-user/SageMaker/saved_models/Llama-3.1-8B-Instruct-test1")

# Prepare your input text
# input_text = "Translate the following English text to French: 'Hello, how are you?'"

input_text = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>Translate the following English text to French:
'Hello, how are you?'<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""


#input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
print(input_ids)

output = model_4bit.generate(**input_ids, max_new_tokens=10)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:
Setting pad_token_id to eos_token_id:None for open-end generation.
{β€˜input_ids’: tensor([[128000, 128000, 128006, 882, 128007, 28573, 279, 2768, 6498,
1495, 311, 8753, 512, 6, 9906, 11, 1268, 527,
499, 20837, 128009, 128006, 78191, 128007]], device=β€˜cuda:0’), β€˜attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device=β€˜cuda:0’)}
userTranslate the following English text to French:
'Hello, how are you?'assistant

β€œBonjour, comment vas-tu?”

I heard that updating the library may fix the problem.

Thank you @John6666 for your response. yes, it works fine after updating the version of transformers on both the ecr images in a local test:

  1. huggingface-pytorch-tgi-inference:
    763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0

  2. pytorch-inference
    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.4.0-gpu-py311-cu124-ubuntu22.04-sagemaker

The issue is, I’m unable to deploy this model on aws sagemaker .

Errors seen from the container logs:

i do see container crash logs again for both the ecr images:
2024-10-10T19:22:25,964 [WARN ] W-9003-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - File β€œ/opt/ml/model/code/inference.py”, line 2, in
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
2024-10-10T19:22:25,968 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named β€˜transformers’

With pytorch-inference i used the following script:

Directory Structure:
model.tar.gz

|- model artifacts 
|- code/
   |- inference.py         # Your inference script
   |- requirements.txt     # Optional, used to install additional dependencies (if supported by your framework version)

file: requirements.txt

transformers>=4.45
accelerate==0.34.2
bitsandbytes==0.44.1
peft==0.13.1

file: inference.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import logging

# Enable logging
logging.basicConfig(level=logging.INFO)

# Model loading function
def model_fn(model_dir):
    # Load the model
    try:
        model_4bit = AutoModelForCausalLM.from_pretrained(model_dir)
    except Exception as e:
        logging.error(f"Error loading model: {e}")
        raise

    # Load the tokenizer
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
    except Exception as e:
        logging.error(f"Error loading tokenizer: {e}")
        raise

    return model_4bit, tokenizer


# Prediction function
def predict_fn(data, model_and_tokenizer):
    model, tokenizer = model_and_tokenizer

    # Logging the input
    logging.info(f"Received input: {data}")

    input_text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>{data['inputs']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

    # Tokenize input and move it to the correct device (GPU/CPU)
    input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate text using the model
    with torch.no_grad():
        output = model.generate(**input_ids, max_new_tokens=50)

    # Decode the output tokens back to text
    result = tokenizer.decode(output[0], skip_special_tokens=True)

    # Logging the generated output
    logging.info(f"Generated output: {result}")

    return result

Deployment script for PytorchModel:

import sagemaker
from sagemaker.pytorch import PyTorchModel

# Define the IAM role and model location in S3
model_data = "s3://compressed-model/Llama-3.1-8B-Instruct-test1-pytorch.tar.gz"
image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.4.0-gpu-py311-cu124-ubuntu22.04-sagemaker"
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
endpoint_name="model-ptq-llama-8b-test1"

# Create a PyTorchModel instance
pytorch_model = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point="inference.py",
    image_uri=image_uri
)

# Deploy the model to an endpoint
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type, # Choose an appropriate instance type
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=health_check_timeout # 10 minutes to be able to load the model
)

Deployment script for HuggingFaceModel

import sagemaker
from sagemaker.huggingface import HuggingFaceModel

# Define the IAM role and model location in S3
model_data = "s3://compressed-model/Llama-3.1-8B-Instruct-test1-pytorch.tar.gz"
image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300
endpoint_name="model-ptq-llama-8b-test3"


llm_model = HuggingFaceModel(
    model_data=model_data,
    role=role,
    entry_point="inference.py",
    source_dir="code",  # Ensure the path to the `code/` directory is specified
    image_uri=image_uri
)
# Deploy the model to an endpoint
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type, # Choose an appropriate instance type
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=health_check_timeout # 10 minutes to be able to load the model
)
1 Like

ModuleNotFoundError: No module named β€˜transformers’

Definitely something wrong in sagemaker side…