Error streaming from llama 3 70b on sagemaker

I’ve deployed llama 3 70b on sagemaker and was able to invoke the llm after deploying from my sagemaker notebook. The problem is that when try to do inference and stream the response back in my python server I get a weird error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpointWithResponseStream operation: Received client error (422) from primary with message “Failed to deserialize the JSON body into the target type: missing field model at line 1 column 180”


Here is my code to invoke with streaming:

smr = boto3.client(
“sagemaker-runtime”, region_name=‘us-east-1’)

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Meta-Llama-3-70B-Instruct”, token=“hf-token”)

messages = [
{“role”: “system”, “content”: “You are a friendly AI Assistant”},
{“role”: “user”, “content”: “hi!”}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(“<|eot_id|>”)
]

payload = {
“max_new_tokens”:512,
“eos_token_id”:terminators,
“do_sample”:True,
“temperature”:0.2,
"top_p”:0.6,
“return_full_text”: False,
}

body = {
“inputs”: prompt,
“parameters”: payload,
“stream”: True,
}

response = smr.invoke_endpoint_with_response_stream(EndpointName=“<my_enpoint>”, Body=json.dumps(body), ContentType=‘application/json’)


Am I missing fields in the payload?

I have this same issue.

I was able to fix the streaming issue by applying a StreamDeserializer to the predictor after the deployment:

from sagemaker.base_deserializers import StreamDeserializer
predictor.deserializer=StreamDeserializer()


But I also had to change the request body:

body = {
“messages”: messages,
“parameters”: payload,
“stream”: True,
“model”: “meta-llama/Meta-Llama-3-70B-Instruct”
}

Here it has to include ‘model’ and ‘messages’ field (and remove ‘inputs’), I am not sure why. There still seems to be an issue when I stream where the model doesn’t shut up and maxes out token generation. I don’t know why it does this because before I added the stream deserializer and just invoked the model normally, it would not have an endless response…?