I have deployed a fine tuned version of Zephyr 7B for function calling while still maintaining its capabilities to generate texts. Using the following code to enable a streaming response from the endpoint as well.
request = {
"inputs": query,
"parameters": {
"do_sample": self.do_sample,
"max_new_tokens": 4096,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"repetition_penalty": repetition_penalty,
"prompt_lookup_num_tokens": prompt_lookup_num_tokens,
"stop": ["\nUser:", "<|endoftext|>", " User:", "###"],
},
"stream": True,
}
response = self.client.invoke_endpoint_with_response_stream(
EndpointName=self.endpoint_name,
Body=json.dumps(request),
ContentType="application/json",
)
But whenever I perform inference where I ask the model to generate longer texts, the streaming stops after 60 seconds and I get the following error,
(ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time.
I am using the following code to deploy the model to the endpoint.
config = {
'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica !!!!!!! Change this to one to use a single GPU!!
# 'MAX_INPUT_LENGTH': json.dumps(512), # Max length of input text. Max_INPUT_LENGTH should be less than MAX_TOTAL_TOKENS
'HF_MODEL_TRUST_REMOTE_CODE': 'true', # Trust execution of remote code to set the max number of tokens.
'MAX_INPUT_LENGTH': json.dumps(10999), # maximum input length from the user
'MAX_TOTAL_TOKENS': json.dumps(11000), # maximum number of tokens altogether including generation and input. Set this according to GPU limits
'MAX_BATCH_TOTAL_TOKENS': json.dumps(12000),
'MAX_BATCH_PREFILL_TOKENS': json.dumps(12000), # this was an addition by me based on the docs. Remove if reverting to old ways.
'HF_MODEL_QUANTIZE': 'eetq', # eetq or bitsandbytes or bitsandbytes-nf4 or bitsandbytes-fp4
'MAX_BATCH_SIZE': json.dumps(1),
'SM_SERVER_TIMEOUT': json.dumps(120),
'SAGEMAKER_MODEL_SERVER_TIMEOUT' : json.dumps(300),
'SAGEMAKER_TS_RESPONSE_TIMEOUT': json.dumps(300),
}
But I still can’t seem to figure out how to enable response stream to last longer than 60 seconds.
Any help will be very much appreciated.