Model Stream Error - Streaming times out after 60 seconds

AmeerH · May 15, 2024, 8:48am

I have deployed a fine tuned version of Zephyr 7B for function calling while still maintaining its capabilities to generate texts. Using the following code to enable a streaming response from the endpoint as well.

request = {
            "inputs": query,
            "parameters": {
                "do_sample": self.do_sample,
                "max_new_tokens": 4096,
                "temperature": temperature,
                "top_k": top_k,
                "top_p": top_p,
                "repetition_penalty": repetition_penalty,
                "prompt_lookup_num_tokens": prompt_lookup_num_tokens,
                "stop": ["\nUser:", "<|endoftext|>", " User:", "###"],
            },
            "stream": True,
            
        }
response = self.client.invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name,
                Body=json.dumps(request),
                ContentType="application/json",
            )

But whenever I perform inference where I ask the model to generate longer texts, the streaming stops after 60 seconds and I get the following error,

(ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time.

I am using the following code to deploy the model to the endpoint.

config = {
    'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
    'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica !!!!!!! Change this to one to use a single GPU!!
    # 'MAX_INPUT_LENGTH': json.dumps(512), # Max length of input text. Max_INPUT_LENGTH should be less than MAX_TOTAL_TOKENS
    'HF_MODEL_TRUST_REMOTE_CODE': 'true',   # Trust execution of remote code to set the max number of tokens.
    'MAX_INPUT_LENGTH': json.dumps(10999),   # maximum input length from the user
    'MAX_TOTAL_TOKENS': json.dumps(11000),   # maximum number of tokens altogether including generation and input. Set this according to GPU limits
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(12000), 
    'MAX_BATCH_PREFILL_TOKENS': json.dumps(12000), # this was an addition by me based on the docs. Remove if reverting to old ways.
    'HF_MODEL_QUANTIZE': 'eetq', # eetq or bitsandbytes or bitsandbytes-nf4 or bitsandbytes-fp4
    'MAX_BATCH_SIZE': json.dumps(1),
    'SM_SERVER_TIMEOUT': json.dumps(120),
    'SAGEMAKER_MODEL_SERVER_TIMEOUT' : json.dumps(300),
    'SAGEMAKER_TS_RESPONSE_TIMEOUT': json.dumps(300),
}

But I still can’t seem to figure out how to enable response stream to last longer than 60 seconds.
Any help will be very much appreciated.

Topic		Replies	Views
SageMaker Model \| How to set Truncation within Config? Amazon SageMaker	3	769	September 25, 2023
Error streaming from llama 3 70b on sagemaker Amazon SageMaker	2	724	April 26, 2024
InvokeEndpoint Error : Predict function Invocation Timeout 🤗Transformers	3	3206	December 1, 2023
CPU/Memory Utilization Too High When Running Inference on Falcon 40B Instruct Amazon SageMaker	4	1578	August 31, 2023
Streaming output text when deploying on Sagemaker Amazon SageMaker	5	2464	October 6, 2023

Model Stream Error - Streaming times out after 60 seconds

Related topics