InvokeEndpoint Error : Predict function Invocation Timeout

I am trying to use AWS S3 option to load the hugging face transformer model GPT-NeoXT-Chat-Base-20B. The endpoint at SageMaker is successfully created.
predictor = huggingface_model.deploy(
ModelDataDownloadTimeoutInSeconds = 2400,
ContainerStartupHealthCheckTimeoutInSeconds = 2400,

While calling the endpoint, getting the invocation timeout. By default, I guess its 1 min, how to increase the timeout interval as the prediction might take more than 1 min?

Error while calling predict function:

predictor.predict({‘inputs’: "Can you please let us know more details about your "})

Error :
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/, in BaseClient._make_api_call(self, operation_name, api_params)
958 error_code = parsed_response.get(“Error”, {}).get(“Code”)
959 error_class = self.exceptions.from_code(error_code)
→ 960 raise error_class(parsed_response, operation_name)
961 else:
962 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message “Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.”. See in account 597748488783 for more information.

Please suggest any pointers to proceed.

Having the same issue with all of the GPT models on AWS. BERT based models work just fine.

Update. I was able to get this to run properly. I wonder if it has to do with the instance size assigned to the predictor. I was taking the default ml.m5.xlarge at first and then tried larger machines. The size you see below finally allowed me to invoke the endpoint.

# deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, # number of instances instance_type='ml.p3.2xlarge' # ec2 instance type )

I’m using a ml.g5.12xlarge for my LLM which is the same size that runs other foundation LLM models of the same param size (ie JumpStart verisions) and I’m still getting the timeout error. So I don’t think it’s just the size of the instance

(Using a fine-tuned falcon-7b model deployed to a ml.g5.12xlarge which is the same size compute i was able to fine-tune the model on)

I’m guessing it might have to do with how inference is called. For instance, in this SageMaker Pipelines example during “Define a Register Model Step…” we just pass an image to use for inference and set image_scope=“inference” and the Model object the model data (but no inference script itself). Later during “Deploy latest approved model to a real-time endpoint” we grab the Approved model_package_arn and deploy the model to an endpoint with model.deploy(initial_instance_count=x, instance_type="compute type", endpoint_name="endpoint name") but I still never see where we tell the endpoint or registered Model how to inference. Is there a black box evaluate/predict method that sagemaker is defaulted to which just doesn’t work yet for HuggingFace LLM type models? Investigating…