Deployment issue on Sagemaker

I want to deploy TheBloke/Llama-2-7b-chat-GPTQ model on Sagemaker and it is giving me this error:
This the code I’m running in sagemaker notebook instance:

import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName = 'sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket = sagemaker_session_bucket)
print(f"sagemaker role arn:{role}")
print(f"sagemaker session region {sess.boto_region_name}")
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

# sagemaker config
instance = "ml.g4dn.xlarge"
number_of_gpus = 1
health_check_timeout = 1000
# Define Model and Endpoint Configuration
hub = {
	'HF_MODEL_ID':'TheBloke/Llama-2-7b-Chat-GPTQ',
	'SM_NUM_GPUS': json.dumps(1),
    'MAX_TOTAL_TOKEN' : json.dumps(5000),
    'HUGGING_FACE_HUB_TOKEN':json.dumps("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")
    
}
assert hub['HUGGING_FACE_HUB_TOKEN'] != "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" , "Please set your hugging face Hub token"

huggingface_model = HuggingFaceModel(
    role=role,
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
    env = hub,
)
llm = huggingface_model.deploy(
    initial_instance_count = 1,
    instance_type= instance,
    container_startup_health_check_timeout = health_check_timeout,
)

Error:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-24-06-51-13-816: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

my CloudWatch logs shows me this:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

2023-08-24T12:42:01.865+05:00	#033[2m2023-08-24T07:42:01.699855Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard complete standard error output:

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
1 Like

The similar issue found. I guess maybe it is caused by the network issue.

Did you find a solution for this? I’m having the same trouble trying to load TheBloke/Llama-2-13b-Chat-GPTQ. I’d like to use a quantized model to save on GPU memory requirements

We are working on releasing TGI 1.0.3 with that it should work correctly, including GPTQ weights.

2 Likes

Thanks, Philipp. Just to check if a working update has been released? @philschmid

Yes got released.

Is anyone else still having issues even with 1.0.3? I just tried to deploy the model: TheBloke/Llama-2-13B-chat-GPTQ and got:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

@philschmid any idea?

Thanks in advance.

1 Like

@josecordero can you please share your code? I successfully deployed the none chat version. TheBloke/Llama-2-13B-GPTQ · Hugging Face

Thanks @philschmid,

I’m following the instructions provided by Huggingface in the Deploy button (upper right side) → Amazon Sagemaker. The code is:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'TheBloke/Llama-2-13B-GPTQ',
	'SM_NUM_GPUS': json.dumps(1)
}



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.2xlarge",
	container_startup_health_check_timeout=300,
  )
  
# send request
predictor.predict({
	"inputs": "My name is Julien and I like to",
})

You need to specifcy the quantize parameter when deploying a GPTQ model in the hub.

'HF_MODEL_QUANTIZE' : 'gptq' 
2 Likes

That worked like a charm. Thanks @philschmid

Hi,
While trying to deploy GPTQ model on Sagemaker I am getting the following error

ValueError: Unsupported huggingface-llm version: 1.0.3. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface-llm versions. Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 0.6, 0.8, 0.9.

Any advice on how to solve it
I am using sagemaker 2.183.0

Can you update yoru sagemaker sdk. 1.0.3 is available which support GPTQ

sagemaker 2.183.0 is the latest version of sagemaker

Hi @philschmid I have tried updating new sagemaker sdk. But still I am facing below error.

UnexpectedStatusException: Error hosting endpoint Wizardcoder: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

Below is the code I am trying to deploy:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'TheBloke/WizardCoder-Python-13B-V1.0-GPTQ',
    'SM_NUM_GPUS': json.dumps(4),
    'HF_MODEL_QUANTIZE' : 'gptq',
    # 'MAX_INPUT_LENGTH' : json.dumps(2048),
    # 'MAX_TOTAL_TOKENS' : json.dumps(4096),
    # 'MAX_BATCH_TOTAL_TOKENS' : json.dumps(8192),
    #'HF_API_TOKEN': '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'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.12xlarge",
    container_startup_health_check_timeout=400,
    endpoint_name="Wizardcoder"
   )

predictor.predict({
    "inputs": "Create snake game in python:",
})

Please Help. Thanks in Advance:)

I am not sure but i don’t think T4 (g4dn) instance works with GPTQ. Can you try g5

Thanks for the guidance. I’ll increase the quota and test on the g5 instance. I previously ran the WizardCoder 15B Base model successfully on a g4dn.12xlarge. However, I faced issues when trying the Bloke/WizardCoder 13B( on the same instance. Could you explain why the 15B model worked on this smaller instance type, but the 13B didn’t?"