Error streaming from llama 3 70b on sagemaker

beksin · April 25, 2024, 8:45pm

I’ve deployed llama 3 70b on sagemaker and was able to invoke the llm after deploying from my sagemaker notebook. The problem is that when try to do inference and stream the response back in my python server I get a weird error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpointWithResponseStream operation: Received client error (422) from primary with message “Failed to deserialize the JSON body into the target type: missing field model at line 1 column 180”

Here is my code to invoke with streaming:

smr = boto3.client(
“sagemaker-runtime”, region_name=‘us-east-1’)

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Meta-Llama-3-70B-Instruct”, token=“hf-token”)

messages = [
{“role”: “system”, “content”: “You are a friendly AI Assistant”},
{“role”: “user”, “content”: “hi!”}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(“<|eot_id|>”)
]

payload = {
“max_new_tokens”:512,
“eos_token_id”:terminators,
“do_sample”:True,
“temperature”:0.2,
"top_p”:0.6,
“return_full_text”: False,
}

body = {
“inputs”: prompt,
“parameters”: payload,
“stream”: True,
}

response = smr.invoke_endpoint_with_response_stream(EndpointName=“<my_enpoint>”, Body=json.dumps(body), ContentType=‘application/json’)

Am I missing fields in the payload?

rgreen-aquent · April 26, 2024, 12:57pm

I have this same issue.

beksin · April 26, 2024, 4:12pm

I was able to fix the streaming issue by applying a StreamDeserializer to the predictor after the deployment:

from sagemaker.base_deserializers import StreamDeserializer
predictor.deserializer=StreamDeserializer()

But I also had to change the request body:

body = {
“messages”: messages,
“parameters”: payload,
“stream”: True,
“model”: “meta-llama/Meta-Llama-3-70B-Instruct”
}

Here it has to include ‘model’ and ‘messages’ field (and remove ‘inputs’), I am not sure why. There still seems to be an issue when I stream where the model doesn’t shut up and maxes out token generation. I don’t know why it does this because before I added the stream deserializer and just invoked the model normally, it would not have an endless response…?

Topic		Replies	Views
Sagemaker model generates incomplete responses (or even completely random output) Amazon SageMaker	0	179	May 23, 2024
Getting ModelError when trying to interact with deployed fine-tuned (LoRA/PEFT) model via Amazon API Gateway and AWS Lambda Amazon SageMaker	3	1671	July 21, 2023
Modelerror when deploying openchat3.5 Amazon SageMaker	0	223	April 2, 2024
Error loading finetuned llama2 model while running inference Amazon SageMaker	27	4799	September 20, 2023
Invoke_endpoint returns error - wrong payload format? Beginners	0	400	January 17, 2024

Error streaming from llama 3 70b on sagemaker

Related topics