I have deployed the ‘meta-llama/Meta-Llama-3-8B-Instruct’ model using HuggingFaceModel. My requirement is to generate the json output.
The model responds with the full json output when i make a call using HuggingFaceModel’s predictor method. Here is the code snippet:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
config = {
'HF_MODEL_ID': "meta-llama/Meta-Llama-3-8B-Instruct", # model_id from hf.co/models
'SM_NUM_GPUS': "1", # Number of GPU used per replica
'MAX_INPUT_LENGTH': "4096", # Max length of input text
'MAX_TOTAL_TOKENS': "8192", # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': "8192", # Limits the number of tokens that can be processed in parallel during the generation
'MESSAGES_API_ENABLED': "true", # Enable the messages API
'HUGGING_FACE_HUB_TOKEN': "hf_XXXX"
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
env=config,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.4xlarge",
container_startup_health_check_timeout=300,
)
standard_content = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
System message
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|> <|eot_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|>
Model Output
<|eot_id|><|start_header_id|>user<|end_header_id|> <|eot_id|>
User Prompt
<|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|>
""";
messages=[
{ "role": "system", "content": "You read an email content extract meaningful and important information to create a meeting request." },
{ "role": "user", "content": standard_content }
]
parameters = {
"model": "meta-llama/Meta-Llama-3-8B-Instruct", # placholder, needed
"max_tokens": 5000,
"max_new_tokens":5000,
"return_full_text": True,
"temperature":1.0,
"use_cache": "false",
}
chat = predictor.predict({"messages" :messages, **parameters});
print(chat["choices"][0]["message"]["content"].strip())
I get the full json output here. The example output:
{
"language": "English",
"tone": "casual",
"sentiment": "positive",
"availability": [
{
"date": "None",
"location": "charming bistro",
}
]
}
However, when i use the sagemaker invoke_endpoint to make get the json output, the response is partial. Here is the code snippet:
# Prompt to generate
messages = [
{"role": "system", "content": "You read an email content extract meaningful and important information to create a meeting request."},
{"role": "user", "content": final_str},
]
# Generation arguments
parameters = {
"model": "meta-llama/Meta-Llama-3-8B-Instruct", # placholder, needed
"max_tokens": 5000,
"max_new_tokens":5000,
"return_full_text": True,
"temperature":1.0,
"use_cache": "false",
"truncation": True,
"max_length": 5000,
"padding": True,
"details": True,
"model_max_length": 5000,
}
body = {
"messages": messages,
"parameters": parameters,
"model": 'meta-llama/Meta-Llama-3-8B-Instruct',
}
payload = json.dumps(body)
# Invoke the SageMaker endpoint
response = client.invoke_endpoint(
EndpointName='huggingface-pytorch-tgi-inference-2024-07-15-14-46-11-164',
ContentType='application/json',
Accept='application/json',
Body=payload
)
response_body = response['Body'].read()
result = json.loads(response_body)
print(result['choices'][0]['message']['content'])
Here is the output (partial invalid json output):
{
"language": "English",
"tone": "casual",
"sentiment": "positive",
"availability": [
{
"date": "None",
"location":
What am i doing wrong? How to fix it?
I have another question:
I am deploying the llama3-8b-instruct model and invoke it. I can’t use predictor method of HuggingFaceModel, as predictor method is available only during the model deployment.
Is there a way to use predictor method of HuggingFaceModel during the inference?