Hey everyone! I am running into an issue when running inference on Falcon 40B Instruct through SageMaker. I’m trying to generate ~50K datapoints (based on 50K different prompts) but after every couple hundred the model errors out with:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Amazon SageMaker could not get a response from the huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* in account *NUMBER* for more information.
I have set up the model following Philipp Schmid’s guide: Deploy Falcon 7B & 40B on Amazon SageMaker (using the exact instance specified in the blog - ml.g5.12xlarge).
The CloudWatch logs don’t seem to provide much insight:
Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
Can anyone point me towards what might be happening, and what sort of instance I might have to get instead?