I’ve finetuned llama2 on a custom dataset following the blogpost. Post finetuning it gets deployed on sagemaker endpoints but when I run inference it throws could not load model.
Error:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
}
Hey thanks for the quick response. Sorry for that but actually I tried a lot of things. I tried explicitly passing S3 path as well, hence the string as I was trying to mask the s3 path since it consisted of my account number.
Hi there, I have the same error, using the same code for deployment. I am not able to run inference on the endpoint deployed on sagemaker after fine tuning. Model I fine tuned is Llama-2-13b-hf.
To anyone else facing this problem, it works totally fine on a plain old EC2 instance with TGI v1.0.0. Which would be because text generation interface added support for Llama2 in v0.9.3 while sagemaker python sdk only recognises upto 0.8.2. I used g4dn.12xlarge instance.
@marimakpandya, please can you explain what do you mean by plain old EC2 instance with TGI v1.0.0 .
Tried with ml.t3.xlarge ec2 instance for fine-tuned llama sage maker endpoint creation with sagemaker version 2.177.0 but still having the exact same error which you have posted.
Hi Everyone! I’m having the same problem…
So it sounds like the Sagemaker Python SDK doesn’t have the most up to date “text generation interface” that is needed for LLaMA 2, are we able to get around this by deploying directly from the AWS Console or is there any way to use the sagemaker & huggingface packages to deploy without building an EC2 instance?
I’m also following the example linked in the original question and after having this issue with my adaptation of it, am currently trying to follow the example as-is.
Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2
config = {
'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),
}
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
Ran Inference
data = {
"inputs": "What is the Capital of California."
}
payload = {
"inputs": json.dumps(data),
"parameters": {
"top_p": 0.6,
"temperature": 0.9,
"top_k": 50,
"max_new_tokens": 512,
"repetition_penalty": 1.03,
}
}
# send request to endpoint
response = llm.predict(payload)
print(response[0]["generated_text"])
Now I’ll be trying to replicate this with a model tuned on my own data!
[EDIT] I managed to run it on my own model: for a Llama2 13B, you need to deploy on an ml.g5.12xlarge (which is a bit weird considering you can run inference on a notebook deployed on ml.g5.2xlarge ).
I get the following error when trying to do inference: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
@philschmid a note here: I am able to deploy and run inference with the fine-tuned model on a g5.2xlarge EC2 instance with no problem by installing the latest transformers[torch], sentencepiece, and protobuf and running:
from transformers import AutoTokenizer
import transformers
import torch
model = "<PATH_TO_MODEL_FILES>"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'<PROMPT>',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
It also works by loading the model into LlamaForCausalLM class which makes me think there might just be a version issue in the LLM DLC?
@philschmid I’ve tried again with the new image and I’m getting exactly the same error: "Could not load model /opt/ml/model with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
@philschmid would it help if I shared by model.tar.gz file with you (fine-tuned Llama-2-7b-chat-hf) ? It’s just a test model so I’m happy to if that would be helpful.
@philschmid FYI - I got the same error trying to manually deploy on Sagemaker with transformers==4.28.1 but updating to transformers==4.33.2 solves the issue.