Error loading finetuned llama2 model while running inference

I’ve finetuned llama2 on a custom dataset following the blogpost. Post finetuning it gets deployed on sagemaker endpoints but when I run inference it throws could not load model.


ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

Following config was used for deployment

huggingface_model = HuggingFaceModel(  

predictor = huggingface_model.deploy(  


1 Like

You are apassing a string ""huggingface_estimator.model_data"

1 Like

Hey thanks for the quick response. Sorry for that but actually I tried a lot of things. I tried explicitly passing S3 path as well, hence the string as I was trying to mask the s3 path since it consisted of my account number.

1 Like

Hi there, I have the same error, using the same code for deployment. I am not able to run inference on the endpoint deployed on sagemaker after fine tuning. Model I fine tuned is Llama-2-13b-hf.

1 Like

To anyone else facing this problem, it works totally fine on a plain old EC2 instance with TGI v1.0.0. Which would be because text generation interface added support for Llama2 in v0.9.3 while sagemaker python sdk only recognises upto 0.8.2. I used g4dn.12xlarge instance.

1 Like

@marimakpandya, please can you explain what do you mean by plain old EC2 instance with TGI v1.0.0 .
Tried with ml.t3.xlarge ec2 instance for fine-tuned llama sage maker endpoint creation with sagemaker version 2.177.0 but still having the exact same error which you have posted.

1 Like

Hi Everyone! I’m having the same problem…
So it sounds like the Sagemaker Python SDK doesn’t have the most up to date “text generation interface” that is needed for LLaMA 2, are we able to get around this by deploying directly from the AWS Console or is there any way to use the sagemaker & huggingface packages to deploy without building an EC2 instance?

I’m also following the example linked in the original question and after having this issue with my adaptation of it, am currently trying to follow the example as-is.


Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).

Here’s what I did:

  1. Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
  2. Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2
image_uri = ""
  1. Used the following Configuration Parameters:
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), 
  1. Create the Model
s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"

llm_model = HuggingFaceModel(
  1. Deployed
llm = llm_model.deploy(
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
  1. Ran Inference
data = {
   "inputs": "What is the Capital of California."

payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,

# send request to endpoint
response = llm.predict(payload)


Now I’ll be trying to replicate this with a model tuned on my own data!

Feel free to reach out if anyone has Qs on this.


That’s an amazingly helpful answer, thank you very much for sharing your code :grinning:

1 Like

Did you manage to run it with your own data ? I encountered the error reported in this thread when trying to deploy mine QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image - #12 by rycfung

[EDIT] I managed to run it on my own model: for a Llama2 13B, you need to deploy on an ml.g5.12xlarge (which is a bit weird considering you can run inference on a notebook deployed on ml.g5.2xlarge :man_shrugging:).

I have had the same issue:

  • I fine-tuned meta-llama/Llama-2-7b-chat-hf on SageMaker according to: Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker
  • The model.tar.gz file is in my S3 bucket
  • I tried to deploy using the code below
  • I get the following error when trying to do inference: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."

@philschmid has there been any resolution here?

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
    'HF_MODEL_ID': '/opt/ml/model',
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(3072),
    'MAX_TOTAL_TOKENS': json.dumps(4096),
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),

# create Hugging Face Model Class
llm_model = HuggingFaceModel(
    model_data= "s3://sagemaker-eu-west-2-688604995696/content-extraction-huggingface-qlora-20-2023-09-14-14-11-27-058/output/model.tar.gz",
    image_uri=llm_image,  #

# deploy model to SageMaker Inference
llm = huggingface_model.deploy(

Could you share how your model.tar.gz looks?

Sure, these are the files after extraction @philschmid:

@philschmid a note here: I am able to deploy and run inference with the fine-tuned model on a g5.2xlarge EC2 instance with no problem by installing the latest transformers[torch], sentencepiece, and protobuf and running:

from transformers import AutoTokenizer
import transformers
import torch


tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(

sequences = pipeline(
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

It also works by loading the model into LlamaForCausalLM class which makes me think there might just be a version issue in the LLM DLC?

Can you try the latest version of the LLM container 1.0.3?

@philschmid I’ve tried again with the new image and I’m getting exactly the same error: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."



@philschmid would it help if I shared by model.tar.gz file with you (fine-tuned Llama-2-7b-chat-hf) ? It’s just a test model so I’m happy to if that would be helpful.

@philschmid FYI - I got the same error trying to manually deploy on Sagemaker with transformers==4.28.1 but updating to transformers==4.33.2 solves the issue.

thanks, tried with TGI v1.0 and worked.

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(


@Mit1208 Is your model fine-tuned?