CPU/Memory Utilization Too High When Running Inference on Falcon 40B Instruct

Hey everyone! I am running into an issue when running inference on Falcon 40B Instruct through SageMaker. I’m trying to generate ~50K datapoints (based on 50K different prompts) but after every couple hundred the model errors out with:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Amazon SageMaker could not get a response from the huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-tgi-inference-2023-06-13-*XYZ* in account *NUMBER* for more information.

I have set up the model following Philipp Schmid’s guide: Deploy Falcon 7B & 40B on Amazon SageMaker (using the exact instance specified in the blog - ml.g5.12xlarge).

The CloudWatch logs don’t seem to provide much insight:

Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(4), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }

Can anyone point me towards what might be happening, and what sort of instance I might have to get instead?

Hi @philschmid do you happen to have any ideas what might be going on?

I’ve noticed the instances behind the endpoint are pretty fragile. I’ve had the same model crash from either too many concurrent requests (even 2 will do it) or after a decent amount of subsequent requests eventually the loaded model will OOM. It takes about ~10 mins for Sagemaker to re-launch a new instance. If you’re just doing subsequent requests I’d suggest bringing at least 2 instances online so when one OOMs, the other can pick up some requests.

Regarding OOM you can test that and make sure that you are setting the correct parameter. The container supports environment variables for defining the for

  • MAX_INPUT_LENGTH(default 1000)
  • MAX_TOTAL_TOKENS (default 1512)
  • MAX_BATCH_SIZE (default none)

That way you can make sure that you are not running OOM by setting the correct boundaries.

Hi @philschmid , how were you able to find the list of all the environment variables supported in the container?