Inference failed for FLAN-UL2(20B) on SageMaker

I am using this blog to deploy FLAN-UL2 on SageMaker (Deploy FLAN-UL2 20B on Amazon SageMaker). Everything works fine except the last 2 steps on inferencing. During inferencing (predictor.predict) it is giving me the below error:

ValueError: (“You need to define one of the following [‘audio-classification’, ‘automatic-speech-recognition’, ‘feature-extraction’, ‘text-classification’, ‘token-classification’, ‘question-answering’, ‘table-question-answering’, ‘visual-question-answering’, ‘document-question-answering’, ‘fill-mask’, ‘summarization’, ‘translation’, ‘text2text-generation’, ‘text-generation’, ‘zero-shot-classification’, ‘zero-shot-image-classification’, ‘conversational’, ‘image-classification’, ‘image-segmentation’, ‘image-to-text’, ‘object-detection’, ‘zero-shot-object-detection’, ‘depth-estimation’, ‘video-classification’] as env ‘HF_TASK’.”, 403)

Did anyone manage to make it work on SageMaker endpoint? Could you please recommend a way to fix the error…

1 Like

Have you made any modifications? Can you please share the whole logs? it seems that your model.tar.gz didn’t get build correctly. Where are you running the code?

Following up on this thread any idea on this issue:

ModelError Traceback (most recent call last)
Cell In[12], line 17
7 parameters = {
8 “early_stopping”: True,
9 “length_penalty”: 2.0,
(…)
13 “no_repeat_ngram_size”: 3,
14 }
16 # Run prediction
—> 17 predictor.predict({
18 “inputs”: payload,
19 “parameters” :parameters
20 })
21 # [{‘generated_text’: ‘Peter stayed with Elizabeth at the hospital for 3 days.’}]

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/predictor.py:161, in Predictor.predict(self, data, initial_args, target_model, target_variant, inference_id)
131 “”“Return the inference from the specified endpoint.
132
133 Args:
(…)
155 as is.
156 “””
158 request_args = self._create_request_args(
159 data, initial_args, target_model, target_variant, inference_id
160 )
→ 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
162 return self._handle_response(response)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py:530, in ClientCreator._create_api_method.._api_call(self, *args, **kwargs)
526 raise TypeError(
527 f"{py_operation_name}() only accepts keyword arguments."
528 )
529 # The “self” in this scope is referring to the BaseClient.
→ 530 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py:960, in BaseClient._make_api_call(self, operation_name, api_params)
958 error_code = parsed_response.get(“Error”, {}).get(“Code”)
959 error_class = self.exceptions.from_code(error_code)
→ 960 raise error_class(parsed_response, operation_name)
961 else:
962 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message “Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.”. See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llm-flan-ul2-20b-fp16-qh-2023-04-03 in account 197614225699 for more information.

Please see some of the logs errors:

1680533644222,“2023-04-03T14:54:02,158 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:37372 ““GET /ping HTTP/1.1"” 200 1”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,987 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py””, line 219, in handle",AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.initialize(context)”,AllTraffic/i-09e62a922fe9422b0
1680533644222,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py”“, line 77, in initialize”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.load(self.model_dir)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py”“, line 104, in load”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = get_pipeline(task=os.environ[”“HF_TASK”“], model_dir=model_dir, device=self.device)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py”“, line 272, in get_pipeline”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py”“, line 903, in pipeline”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return pipeline_class(model=model, framework=framework, task=task, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,988 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py”“, line 65, in init”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - super().init(*args, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py”“, line 780, in init”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.model.to(self.device)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py”“, line 1749, in to”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return super().to(*args, **kwargs)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 989, in to”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self._apply(convert)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 641, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - module._apply(fn)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,989 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [Previous line repeated 4 more times]”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 664, in _apply”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - param_applied = fn(param)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”“, line 987, in convert”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 22.20 GiB total capacity; 21.24 GiB already allocated; 222.12 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - “,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - “,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/mms/service.py””, line 108, in predict”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context)”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,990 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “”/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py””, line 243, in handle”,AllTraffic/i-09e62a922fe9422b0
1680533644223,“2023-04-03T14:54:03,991 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise PredictionException(str(e), 400)”,AllTraffic/i-09e62a922fe9422b0
1680533645477,“2023-04-03T14:54:03,991 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 22.20 GiB total capacity; 21.24 GiB already allocated; 222.12 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF : 400”,AllTraffic/i-09e62a922fe9422b0
1680533645477,“2023-04-03T14:54:05,401 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 165228”,AllTraffic/i-09e62a922fe9422b0
1680533647231,"2023-04-03T14:54:05,402 [INFO ] W-9000-model ACCESS_LOG - /169.254.178.2:53412 ““POST /invocations HTTP/1.1"” 400 165231”,AllTraffic/i-09e62a922fe9422b0

@Jorgeutd could you share how you deployed the model with which versions?

@philschmid, I relooked at the model tar file, and there was an issue in terms of using pigz library. after fixing it, the inference in the notebook ran correctly. I was also able to use boto3 to invoke the endpoint. Superb!

Hey @philschmid, I followed step by step your notebook; here is the deployment code:

from sagemaker.huggingface.model import HuggingFaceModel

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
model_data=s3_location, # path to your model and script
role=role, # iam role with permissions to create an Endpoint
transformers_version=“4.26”, # transformers version used
pytorch_version=“1.13”, # pytorch version used
py_version=‘py39’, # python version used
model_server_workers=1,
name=model_name
)

deploy the endpoint endpoint

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=“ml.g5.12xlarge”,
endpoint_name=endpoint_name,

container_startup_health_check_timeout=600, # increase timeout for large models

model_data_download_timeout=600 # increase timeout for large models

Here is the req file:

%%writefile code/requirements.txt
accelerate==0.16.0
transformers==4.26.0
bitsandbytes==0.37.0

@philschmid I redeployed the model with the new changes on the notebook and now it is working fine.
%%writefile code/requirements.txt
accelerate==0.15.0
transformers==4.27.2

I also changed this part of the code:

from pathlib import Path
import os

set HF_HUB_ENABLE_HF_TRANSFER env var to enable hf-transfer for faster downloads

os.environ[“HF_HUB_ENABLE_HF_TRANSFER”] = “1”
from huggingface_hub import snapshot_download

HF_MODEL_ID=“google/flan-ul2”

Create a directory for the model in the specified directory

base_dir = Path(“/home/ec2-user/SageMaker/”)
model_tar_dir = base_dir.joinpath(“model_directory”)
model_tar_dir.mkdir(exist_ok=True)

Set cache directory to be in the same filesystem as model_tar_dir

cache_dir = base_dir.joinpath(“.cache”)
cache_dir.mkdir(exist_ok=True)

Download model from Hugging Face into model_dir

snapshot_download(HF_MODEL_ID, local_dir=str(model_tar_dir), local_dir_use_symlinks=False, cache_dir=str(cache_dir))

copy code/ to model dir

copy_tree(“code/”, str(model_tar_dir.joinpath(“code”)))

Is there an end to end example on how to fine tune a model like this?

Thanks,

Jorge