Deployment issue on Sagemaker

faiqaslam · August 24, 2023, 10:08am

I want to deploy TheBloke/Llama-2-7b-chat-GPTQ model on Sagemaker and it is giving me this error:
This the code I’m running in sagemaker notebook instance:

import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName = 'sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket = sagemaker_session_bucket)
print(f"sagemaker role arn:{role}")
print(f"sagemaker session region {sess.boto_region_name}")
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

# sagemaker config
instance = "ml.g4dn.xlarge"
number_of_gpus = 1
health_check_timeout = 1000
# Define Model and Endpoint Configuration
hub = {
	'HF_MODEL_ID':'TheBloke/Llama-2-7b-Chat-GPTQ',
	'SM_NUM_GPUS': json.dumps(1),
    'MAX_TOTAL_TOKEN' : json.dumps(5000),
    'HUGGING_FACE_HUB_TOKEN':json.dumps("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")
    
}
assert hub['HUGGING_FACE_HUB_TOKEN'] != "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" , "Please set your hugging face Hub token"

huggingface_model = HuggingFaceModel(
    role=role,
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
    env = hub,
)
llm = huggingface_model.deploy(
    initial_instance_count = 1,
    instance_type= instance,
    container_startup_health_check_timeout = health_check_timeout,
)

Error:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-24-06-51-13-816: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

my CloudWatch logs shows me this:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

2023-08-24T12:42:01.865+05:00	#033[2m2023-08-24T07:42:01.699855Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard complete standard error output:

You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.

perlmonk2023 · September 1, 2023, 7:58am

The similar issue found. I guess maybe it is caused by the network issue.

a-smith · September 8, 2023, 7:40am

Did you find a solution for this? I’m having the same trouble trying to load TheBloke/Llama-2-13b-Chat-GPTQ. I’d like to use a quantized model to save on GPU memory requirements

philschmid · September 8, 2023, 7:58am

We are working on releasing TGI 1.0.3 with that it should work correctly, including GPTQ weights.

shizhe1 · September 16, 2023, 11:07pm

Thanks, Philipp. Just to check if a working update has been released? @philschmid

philschmid · September 18, 2023, 7:17am

Yes got released.

josecordero · September 19, 2023, 2:13pm

Is anyone else still having issues even with 1.0.3? I just tried to deploy the model: TheBloke/Llama-2-13B-chat-GPTQ and got:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

@philschmid any idea?

Thanks in advance.

philschmid · September 20, 2023, 7:29am

@josecordero can you please share your code? I successfully deployed the none chat version. TheBloke/Llama-2-13B-GPTQ · Hugging Face

josecordero · September 20, 2023, 8:35am

Thanks @philschmid,

I’m following the instructions provided by Huggingface in the Deploy button (upper right side) → Amazon Sagemaker. The code is:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'TheBloke/Llama-2-13B-GPTQ',
	'SM_NUM_GPUS': json.dumps(1)
}



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.2xlarge",
	container_startup_health_check_timeout=300,
  )
  
# send request
predictor.predict({
	"inputs": "My name is Julien and I like to",
})

philschmid · September 20, 2023, 8:52am

You need to specifcy the quantize parameter when deploying a GPTQ model in the hub.

'HF_MODEL_QUANTIZE' : 'gptq'

josecordero · September 20, 2023, 9:35am

That worked like a charm. Thanks @philschmid

Siddharth63 · October 2, 2023, 12:23pm

Hi,
While trying to deploy GPTQ model on Sagemaker I am getting the following error

ValueError: Unsupported huggingface-llm version: 1.0.3. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface-llm versions. Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 0.6, 0.8, 0.9.

Any advice on how to solve it
I am using sagemaker 2.183.0

philschmid · October 2, 2023, 12:41pm

Can you update yoru sagemaker sdk. 1.0.3 is available which support GPTQ

Siddharth63 · October 2, 2023, 12:45pm

sagemaker 2.183.0 is the latest version of sagemaker

HuggingKiranFace · October 3, 2023, 10:09am

Hi @philschmid I have tried updating new sagemaker sdk. But still I am facing below error.

UnexpectedStatusException: Error hosting endpoint Wizardcoder: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

Below is the code I am trying to deploy:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'TheBloke/WizardCoder-Python-13B-V1.0-GPTQ',
    'SM_NUM_GPUS': json.dumps(4),
    'HF_MODEL_QUANTIZE' : 'gptq',
    # 'MAX_INPUT_LENGTH' : json.dumps(2048),
    # 'MAX_TOTAL_TOKENS' : json.dumps(4096),
    # 'MAX_BATCH_TOTAL_TOKENS' : json.dumps(8192),
    #'HF_API_TOKEN': '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'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.12xlarge",
    container_startup_health_check_timeout=400,
    endpoint_name="Wizardcoder"
   )

predictor.predict({
    "inputs": "Create snake game in python:",
})

Please Help. Thanks in Advance:)

philschmid · October 4, 2023, 8:19am

I am not sure but i don’t think T4 (g4dn) instance works with GPTQ. Can you try g5

HuggingKiranFace · October 4, 2023, 12:16pm

Thanks for the guidance. I’ll increase the quota and test on the g5 instance. I previously ran the WizardCoder 15B Base model successfully on a g4dn.12xlarge. However, I faced issues when trying the Bloke/WizardCoder 13B( on the same instance. Could you explain why the 15B model worked on this smaller instance type, but the 13B didn’t?"

Topic		Replies	Views
Error hosting endpoint when deploying model Amazon SageMaker	2	3052	March 27, 2024
Deploying TheBloke/Luna-AI-Llama2-Uncensored-GGML Amazon SageMaker	0	844	September 11, 2023
QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image Amazon SageMaker	14	2985	August 18, 2023
Error loading finetuned llama2 model while running inference Amazon SageMaker	27	4809	September 20, 2023
Mistral AI Sagemaker deployment failing Amazon SageMaker	3	2070	December 29, 2023

Deployment issue on Sagemaker

Related topics