CPU/Memory Utilization Too High When Running Inference on Falcon 40B Instruct

cvetanovskaa · June 22, 2023, 2:40pm

Hey everyone! I am running into an issue when running inference on Falcon 40B Instruct through SageMaker. I’m trying to generate ~50K datapoints (based on 50K different prompts) but after every couple hundred the model errors out with:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Amazon SageMaker could not get a response from the huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* in account *NUMBER* for more information.

I have set up the model following Philipp Schmid’s guide: Deploy Falcon 7B & 40B on Amazon SageMaker (using the exact instance specified in the blog - ml.g5.12xlarge).

The CloudWatch logs don’t seem to provide much insight:

Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }

Can anyone point me towards what might be happening, and what sort of instance I might have to get instead?

cvetanovskaa · June 26, 2023, 2:38pm

Hi @philschmid do you happen to have any ideas what might be going on?

jtmtech · June 26, 2023, 6:13pm

I’ve noticed the instances behind the endpoint are pretty fragile. I’ve had the same model crash from either too many concurrent requests (even 2 will do it) or after a decent amount of subsequent requests eventually the loaded model will OOM. It takes about ~10 mins for Sagemaker to re-launch a new instance. If you’re just doing subsequent requests I’d suggest bringing at least 2 instances online so when one OOMs, the other can pick up some requests.

philschmid · June 27, 2023, 6:40am

Regarding OOM you can test that and make sure that you are setting the correct parameter. The container supports environment variables for defining the for

MAX_CONCURRENT_REQUESTS(default 128)
MAX_INPUT_LENGTH(default 1000)
MAX_TOTAL_TOKENS (default 1512)
MAX_BATCH_SIZE (default none)

That way you can make sure that you are not running OOM by setting the correct boundaries.

enniop · August 31, 2023, 8:57am

Hi @philschmid , how were you able to find the list of all the environment variables supported in the container?

Topic		Replies	Views
503 No worker is available when calling single huggingface endpoint Amazon SageMaker	11	4391	April 7, 2022
Inference error for FLAN-UL2 on AWS SageMaker Amazon SageMaker	1	964	April 3, 2023
Impossible to use flan-t5-xxl in a batch-transform job Amazon SageMaker	3	1157	May 23, 2023
GPT-J fails on Amazon Sagemaker Models	2	1299	July 21, 2022
Inference failed for FLAN-UL2(20B) on SageMaker Amazon SageMaker	6	2182	April 4, 2023

CPU/Memory Utilization Too High When Running Inference on Falcon 40B Instruct

Related topics