I want to deploy TheBloke/Llama-2-7b-chat-GPTQ model on Sagemaker and it is giving me this error:
This the code I’m running in sagemaker notebook instance:
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName = 'sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket = sagemaker_session_bucket)
print(f"sagemaker role arn:{role}")
print(f"sagemaker session region {sess.boto_region_name}")
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
# sagemaker config
instance = "ml.g4dn.xlarge"
number_of_gpus = 1
health_check_timeout = 1000
# Define Model and Endpoint Configuration
hub = {
'HF_MODEL_ID':'TheBloke/Llama-2-7b-Chat-GPTQ',
'SM_NUM_GPUS': json.dumps(1),
'MAX_TOTAL_TOKEN' : json.dumps(5000),
'HUGGING_FACE_HUB_TOKEN':json.dumps("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")
}
assert hub['HUGGING_FACE_HUB_TOKEN'] != "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" , "Please set your hugging face Hub token"
huggingface_model = HuggingFaceModel(
role=role,
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
env = hub,
)
llm = huggingface_model.deploy(
initial_instance_count = 1,
instance_type= instance,
container_startup_health_check_timeout = health_check_timeout,
)
Error:
UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-24-06-51-13-816: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
my CloudWatch logs shows me this:
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
2023-08-24T12:42:01.865+05:00 #033[2m2023-08-24T07:42:01.699855Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard complete standard error output:
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Did you find a solution for this? I’m having the same trouble trying to load TheBloke/Llama-2-13b-Chat-GPTQ. I’d like to use a quantized model to save on GPU memory requirements
I’m following the instructions provided by Huggingface in the Deploy button (upper right side) → Amazon Sagemaker. The code is:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'TheBloke/Llama-2-13B-GPTQ',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
# send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})
Hi,
While trying to deploy GPTQ model on Sagemaker I am getting the following error
ValueError: Unsupported huggingface-llm version: 1.0.3. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface-llm versions. Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 0.6, 0.8, 0.9.
Any advice on how to solve it
I am using sagemaker 2.183.0
Hi @philschmid I have tried updating new sagemaker sdk. But still I am facing below error.
UnexpectedStatusException: Error hosting endpoint Wizardcoder: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
Below is the code I am trying to deploy:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'TheBloke/WizardCoder-Python-13B-V1.0-GPTQ',
'SM_NUM_GPUS': json.dumps(4),
'HF_MODEL_QUANTIZE' : 'gptq',
# 'MAX_INPUT_LENGTH' : json.dumps(2048),
# 'MAX_TOTAL_TOKENS' : json.dumps(4096),
# 'MAX_BATCH_TOTAL_TOKENS' : json.dumps(8192),
#'HF_API_TOKEN': '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'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.12xlarge",
container_startup_health_check_timeout=400,
endpoint_name="Wizardcoder"
)
predictor.predict({
"inputs": "Create snake game in python:",
})
Thanks for the guidance. I’ll increase the quota and test on the g5 instance. I previously ran the WizardCoder 15B Base model successfully on a g4dn.12xlarge. However, I faced issues when trying the Bloke/WizardCoder 13B( on the same instance. Could you explain why the 15B model worked on this smaller instance type, but the 13B didn’t?"