Inference error for FLAN-UL2 on AWS SageMaker

Following up on this thread any idea on this issue:

ModelError Traceback (most recent call last)
Cell In[12], line 17
7 parameters = {
8 “early_stopping”: True,
9 “length_penalty”: 2.0,
(…)
13 “no_repeat_ngram_size”: 3,
14 }
16 # Run prediction
—> 17 predictor.predict({
18 “inputs”: payload,
19 “parameters” :parameters
20 })
21 # [{‘generated_text’: ‘Peter stayed with Elizabeth at the hospital for 3 days.’}]

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/predictor.py:161, in Predictor.predict(self, data, initial_args, target_model, target_variant, inference_id)
131 “”“Return the inference from the specified endpoint.
132
133 Args:
(…)
155 as is.
156 “””
158 request_args = self._create_request_args(
159 data, initial_args, target_model, target_variant, inference_id
160 )
→ 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
162 return self._handle_response(response)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py:530, in ClientCreator._create_api_method…_api_call(self, *args, **kwargs)
526 raise TypeError(
527 f"{py_operation_name}() only accepts keyword arguments."
528 )
529 # The “self” in this scope is referring to the BaseClient.
→ 530 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py:960, in BaseClient._make_api_call(self, operation_name, api_params)
958 error_code = parsed_response.get(“Error”, {}).get(“Code”)
959 error_class = self.exceptions.from_code(error_code)
→ 960 raise error_class(parsed_response, operation_name)
961 else:
962 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message “Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.”. See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llm-flan-ul2-20b-fp16-qh-2023-04-03 in account 197614225699 for more information.

Please see some of the logs errors:

1680533644222,“2023-04-03T14:54:02,158 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:37372 ““GET /ping HTTP/1.1"” 200 1”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py””, line 219, in handle",AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.initialize(context)”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py”“, line 77, in initialize”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.load(self.model_dir)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py”“, line 104, in load”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = get_pipeline(task=os.environ[”“HF_TASK”“], model_dir=model_dir, device=self.device)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py”“, line 272, in get_pipeline”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py”“, line 903, in pipeline”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return pipeline_class(model=model, framework=framework, task=task, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py”“, line 65, in init”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - super().init(*args, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py”“, line 780, in init”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.model.to(self.device)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py”“, line 1749, in to”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return super().to(*args, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 989, in to”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self._apply(convert)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [Previous line repeated 4 more times]”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 664, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - param_applied = fn(param)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 987, in convert”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 22.20 GiB total capacity; 21.24 GiB already allocated; 222.12 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - “,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - “,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/mms/service.py””, line 108, in predict”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py””, line 243, in handle”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,991 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise PredictionException(str(e), 400)”,AllTraffic/i-09e62a922fe9422b0
1680533645477,“2023-04-03T14:54:03,991 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 22.20 GiB total capacity; 21.24 GiB already allocated; 222.12 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF : 400”,AllTraffic/i-09e62a922fe9422b0
1680533645477,“2023-04-03T14:54:05,401 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 165228”,AllTraffic/i-09e62a922fe9422b0
1680533647231,“2023-04-03T14:54:05,402 [INFO ] W-9000-model ACCESS_LOG - /169.254.178.2:53412 ““POST /invocations HTTP/1.1”” 400 165231”,AllTraffic/i-09e62a922fe9422b0

see Inference failed for FLAN-UL2(20B) on SageMaker - #4 by philschmid