AWS Sagemaker doesn't return the full response

I have deployed the ‘meta-llama/Meta-Llama-3-8B-Instruct’ model using HuggingFaceModel. My requirement is to generate the json output.
The model responds with the full json output when i make a call using HuggingFaceModel’s predictor method. Here is the code snippet:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']


config = {
  'HF_MODEL_ID': "meta-llama/Meta-Llama-3-8B-Instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': "1", # Number of GPU used per replica
  'MAX_INPUT_LENGTH': "4096",  # Max length of input text
  'MAX_TOTAL_TOKENS': "8192",  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': "8192",  # Limits the number of tokens that can be processed in parallel during the generation
  'MESSAGES_API_ENABLED': "true", # Enable the messages API
  'HUGGING_FACE_HUB_TOKEN': "hf_XXXX"
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
	env=config,
	role=role, 
) 

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.4xlarge",
	container_startup_health_check_timeout=300,
  )



  standard_content = """ 

  <|begin_of_text|><|start_header_id|>system<|end_header_id|>
  System message

  <|eot_id|><|start_header_id|>user<|end_header_id|>
   User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|>
  Model Output
  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|>
  Model Output
  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt
  
  <|eot_id|><|start_header_id|>assistant<|end_header_id|>

  Model Output

  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|>
  Model Output
  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|>

  Model Output

  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt
  
  <|eot_id|><|start_header_id|>assistant<|end_header_id|>

  Model Output

  <|eot_id|><|start_header_id|>user<|end_header_id|>

 User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|>

Model Output
  <|eot_id|><|start_header_id|>user<|end_header_id|>

  User Prompt

   <|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|>

   Model Output

   <|eot_id|><|start_header_id|>user<|end_header_id|> <|eot_id|>

User Prompt

   <|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|>

   Model Output

   <|eot_id|><|start_header_id|>user<|end_header_id|> <|eot_id|>

  User Prompt

  <|eot_id|><|start_header_id|>assistant<|end_header_id|> <|eot_id|> 
  
  """;

  messages=[
      { "role": "system", "content": "You read an email content extract meaningful and important information to create a meeting request." },
      { "role": "user", "content": standard_content }
    ]

  parameters = {
      "model": "meta-llama/Meta-Llama-3-8B-Instruct", # placholder, needed
      "max_tokens": 5000,
      "max_new_tokens":5000,
      "return_full_text": True,
      "temperature":1.0,
      "use_cache": "false",
      
  }

  chat = predictor.predict({"messages" :messages, **parameters});
  print(chat["choices"][0]["message"]["content"].strip())

I get the full json output here. The example output:

{
  "language": "English",
  "tone": "casual",
  "sentiment": "positive",
   "availability": [
        { 
            "date": "None",
            "location": "charming bistro",
        }
    ]
  }

However, when i use the sagemaker invoke_endpoint to make get the json output, the response is partial. Here is the code snippet:

# Prompt to generate
messages = [
    {"role": "system", "content": "You read an email content extract meaningful and important information to create a meeting request."},
    {"role": "user", "content": final_str},
]


# Generation arguments
parameters = {
    "model": "meta-llama/Meta-Llama-3-8B-Instruct", # placholder, needed
    "max_tokens": 5000,
    "max_new_tokens":5000,
    "return_full_text": True,
    "temperature":1.0,
    "use_cache": "false",
      "truncation": True,
                  "max_length": 5000,
                  "padding": True,
    "details": True,
     "model_max_length": 5000,
}

body = {
"messages": messages,
"parameters": parameters,
"model": 'meta-llama/Meta-Llama-3-8B-Instruct',
}

payload = json.dumps(body)

# Invoke the SageMaker endpoint
response = client.invoke_endpoint(
    EndpointName='huggingface-pytorch-tgi-inference-2024-07-15-14-46-11-164',
    ContentType='application/json',
    Accept='application/json',
    Body=payload
)


response_body = response['Body'].read()
result = json.loads(response_body)
print(result['choices'][0]['message']['content'])

Here is the output (partial invalid json output):

{
"language": "English",
"tone": "casual",
"sentiment": "positive",
 "availability": [
      { 
          "date": "None",
          "location":

What am i doing wrong? How to fix it?

I have another question:

I am deploying the llama3-8b-instruct model and invoke it. I can’t use predictor method of HuggingFaceModel, as predictor method is available only during the model deployment.
Is there a way to use predictor method of HuggingFaceModel during the inference?

As per the documentation here, link I re-created the predictor object used for inference.

Code Snippet:

from sagemaker.huggingface.model import HuggingFacePredictor  

predictor = HuggingFacePredictor(endpoint_name='your-huggingface-endpoint-name')

The issue is resolved now.

1 Like