Alright, I finally got it working! Another Discussion about the same issue got me there(QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image).
Here’s what I did:
- Instead of deploying directly after tuning, I created a HuggingFace Model from the S3 archive of my tuned model
- Used the following image_uri by hardcoding the URI instead of pulling it using
get_huggingface_llm_image_uri()
which at least a few weeks ago wasn’t getting the most up to date version which supported LLaMA-2
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"
- Used the following Configuration Parameters:
config = {
'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),
}
- Create the Model
s3_model_uri = "s3://{your_path_here}/output/model.tar.gz"
instance_type = "ml.g5.4xlarge"
llm_model = HuggingFaceModel(
role=role,
image_uri=image_uri,
model_data=s3_model_uri,
env=config
)
- Deployed
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
- Ran Inference
data = {
"inputs": "What is the Capital of California."
}
payload = {
"inputs": json.dumps(data),
"parameters": {
"top_p": 0.6,
"temperature": 0.9,
"top_k": 50,
"max_new_tokens": 512,
"repetition_penalty": 1.03,
}
}
# send request to endpoint
response = llm.predict(payload)
print(response[0]["generated_text"])
Now I’ll be trying to replicate this with a model tuned on my own data!
Feel free to reach out if anyone has Qs on this.