Error loading finetuned llama2 model while running inference

Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).

Here’s what I did:

  1. Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
  2. Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"
  1. Used the following Configuration Parameters:
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), 
}
  1. Create the Model
s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"

llm_model = HuggingFaceModel(
    role=role,
    image_uri=image_uri,
    model_data=s3_model_uri,
    env=config
)
  1. Deployed
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
 
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
  1. Ran Inference
data = {
   "inputs": "What is the Capital of California."
}

payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"])

Now I’ll be trying to replicate this with a model tuned on my own data!

Feel free to reach out if anyone has Qs on this.

2 Likes