Error loading finetuned llama2 model while running inference

abeiler · August 21, 2023, 5:07pm

Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).

Here’s what I did:

Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
Used the following image_uri by hardcoding the URI instead of pulling it using get_huggingface_llm_image_uri() which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2

image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"

Used the following Configuration Parameters:

config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), 
}

Create the Model

s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"

llm_model = HuggingFaceModel(
    role=role,
    image_uri=image_uri,
    model_data=s3_model_uri,
    env=config
)

Deployed

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
 
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Ran Inference

data = {
   "inputs": "What is the Capital of California."
}

payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"])

Now I’ll be trying to replicate this with a model tuned on my own data!

Feel free to reach out if anyone has Qs on this.

Topic		Replies	Views
ValueError: Could not load model /opt/ml/model with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>) Amazon SageMaker	0	389	March 13, 2024
Error code 400 when running llama2 on sagemaker endpoint Amazon SageMaker	1	1219	July 24, 2023
Getting error in the inference stage of Transformers Model (Hugging Face) 🤗Transformers	0	780	October 11, 2022
InternalServerException when running a model loaded on S3 Amazon SageMaker	4	983	August 6, 2021
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2964	April 30, 2022

Error loading finetuned llama2 model while running inference

Related topics