Error loading finetuned llama2 model while running inference

Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).

Here’s what I did:

  1. Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
  2. Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2
image_uri = ""
  1. Used the following Configuration Parameters:
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), 
  1. Create the Model
s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"

llm_model = HuggingFaceModel(
  1. Deployed
llm = llm_model.deploy(
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
  1. Ran Inference
data = {
   "inputs": "What is the Capital of California."

payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,

# send request to endpoint
response = llm.predict(payload)


Now I’ll be trying to replicate this with a model tuned on my own data!

Feel free to reach out if anyone has Qs on this.