Error loading finetuned llama2 model while running inference

marmikpandya · July 31, 2023, 4:14pm

I’ve finetuned llama2 on a custom dataset following the blogpost. Post finetuning it gets deployed on sagemaker endpoints but when I run inference it throws could not load model.

Error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
}

Following config was used for deployment

huggingface_model = HuggingFaceModel(  
    model_data=huggingface_estimator.model_data,  
    role=role,  
    transformers_version='4.28',  
    pytorch_version='2.0',  
    py_version='py310', 
)  

predictor = huggingface_model.deploy(  
    initial_instance_count=1,  
    instance_type='ml.g4dn.4xlarge',  
    
)

Thanks.

philschmid · July 31, 2023, 4:45pm

You are apassing a string ""huggingface_estimator.model_data"

marmikpandya · July 31, 2023, 5:29pm

Hey thanks for the quick response. Sorry for that but actually I tried a lot of things. I tried explicitly passing S3 path as well, hence the string as I was trying to mask the s3 path since it consisted of my account number.

guiba44 · August 2, 2023, 7:46am

Hi there, I have the same error, using the same code for deployment. I am not able to run inference on the endpoint deployed on sagemaker after fine tuning. Model I fine tuned is Llama-2-13b-hf.

marmikpandya · August 2, 2023, 10:03am

To anyone else facing this problem, it works totally fine on a plain old EC2 instance with TGI v1.0.0. Which would be because text generation interface added support for Llama2 in v0.9.3 while sagemaker python sdk only recognises upto 0.8.2. I used g4dn.12xlarge instance.

Neelesh1121 · August 14, 2023, 2:40pm

@marimakpandya, please can you explain what do you mean by plain old EC2 instance with TGI v1.0.0 .
Tried with ml.t3.xlarge ec2 instance for fine-tuned llama sage maker endpoint creation with sagemaker version 2.177.0 but still having the exact same error which you have posted.

abeiler · August 21, 2023, 3:06pm

Hi Everyone! I’m having the same problem…
So it sounds like the Sagemaker Python SDK doesn’t have the most up to date “text generation interface” that is needed for LLaMA 2, are we able to get around this by deploying directly from the AWS Console or is there any way to use the sagemaker & huggingface packages to deploy without building an EC2 instance?

I’m also following the example linked in the original question and after having this issue with my adaptation of it, am currently trying to follow the example as-is.

Thanks!

abeiler · August 21, 2023, 5:07pm

Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).

Here’s what I did:

Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2

image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"

Used the following Configuration Parameters:

config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), 
}

Create the Model

s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"

llm_model = HuggingFaceModel(
    role=role,
    image_uri=image_uri,
    model_data=s3_model_uri,
    env=config
)

Deployed

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
 
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Ran Inference

data = {
   "inputs": "What is the Capital of California."
}

payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"])

Now I’ll be trying to replicate this with a model tuned on my own data!

Feel free to reach out if anyone has Qs on this.

guiba44 · August 22, 2023, 7:31am

That’s an amazingly helpful answer, thank you very much for sharing your code

guiba44 · August 24, 2023, 8:57am

Did you manage to run it with your own data ? I encountered the error reported in this thread when trying to deploy mine QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image - #12 by rycfung…

[EDIT] I managed to run it on my own model: for a Llama2 13B, you need to deploy on an ml.g5.12xlarge (which is a bit weird considering you can run inference on a notebook deployed on ml.g5.2xlarge ).

jeremydd · September 15, 2023, 11:27am

I have had the same issue:

I fine-tuned meta-llama/Llama-2-7b-chat-hf on SageMaker according to: Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker
The model.tar.gz file is in my S3 bucket
I tried to deploy using the code below
I get the following error when trying to do inference: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

@philschmid has there been any resolution here?

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
    'HF_MODEL_ID': '/opt/ml/model',
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(3072),
    'MAX_TOTAL_TOKENS': json.dumps(4096),
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),
}

# create Hugging Face Model Class
llm_model = HuggingFaceModel(
    model_data= "s3://sagemaker-eu-west-2-688604995696/content-extraction-huggingface-qlora-20-2023-09-14-14-11-27-058/output/model.tar.gz",
    role=role,
    image_uri=llm_image,  # 763104351884.dkr.ecr.eu-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    env=config
)

# deploy model to SageMaker Inference
llm = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

philschmid · September 15, 2023, 12:05pm

Could you share how your model.tar.gz looks?

jeremydd · September 15, 2023, 12:19pm

Sure, these are the files after extraction @philschmid:

jeremydd · September 15, 2023, 2:11pm

@philschmid a note here: I am able to deploy and run inference with the fine-tuned model on a g5.2xlarge EC2 instance with no problem by installing the latest transformers[torch], sentencepiece, and protobuf and running:

from transformers import AutoTokenizer
import transformers
import torch

model = "<PATH_TO_MODEL_FILES>"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    '<PROMPT>',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

It also works by loading the model into LlamaForCausalLM class which makes me think there might just be a version issue in the LLM DLC?

philschmid · September 18, 2023, 7:17am

Can you try the latest version of the LLM container 1.0.3?

jeremydd · September 18, 2023, 11:33am

@philschmid I’ve tried again with the new image and I’m getting exactly the same error: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

Deployment:

Inference:

jeremydd · September 18, 2023, 11:37am

@philschmid would it help if I shared by model.tar.gz file with you (fine-tuned Llama-2-7b-chat-hf) ? It’s just a test model so I’m happy to if that would be helpful.

jeremydd · September 18, 2023, 2:59pm

@philschmid FYI - I got the same error trying to manually deploy on Sagemaker with transformers==4.28.1 but updating to transformers==4.33.2 solves the issue.

Mit1208 · September 18, 2023, 8:05pm

thanks, tried with TGI v1.0 and worked.

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=“1.0”
)

jeremydd · September 19, 2023, 11:44am

@Mit1208 Is your model fine-tuned?

Topic		Replies	Views
ValueError: Could not load model /opt/ml/model with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>) Amazon SageMaker	0	389	March 13, 2024
Error code 400 when running llama2 on sagemaker endpoint Amazon SageMaker	1	1219	July 24, 2023
Getting error in the inference stage of Transformers Model (Hugging Face) 🤗Transformers	0	781	October 11, 2022
InternalServerException when running a model loaded on S3 Amazon SageMaker	4	984	August 6, 2021
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2967	April 30, 2022

Error loading finetuned llama2 model while running inference

retrieve the llm image uri

Related topics