ModelError when I run predict after deploying wizardcoder for text-generation

Here’s my code:

import json
from sagemaker.huggingface import HuggingFaceModel
import sagemaker

env = {'HF_TASK': 'text-generation'}

#create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://generative-ai/WizardCoder/model.tar.gz",  # path to your trained sagemaker model
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.26",                             # Transformers version used
   env=env,
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)
#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1, 
   instance_type="ml.g5.2xlarge"
)

# send request
predictor.predict({
	"inputs": "Write a python program to reverse a string",
})

I am getting below error:


ModelError Traceback (most recent call last)
Cell In[24], line 2
1 # send request
----> 2 predictor.predict({
3 “inputs”: “Write a python program to reverse a string”,
4 })

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/base_predictor.py:185, in Predictor.predict(self, data, initial_args, target_model, target_variant, inference_id, custom_attributes)
138 “”“Return the inference from the specified endpoint.
139
140 Args:
(…)
174 as is.
175 “””
177 request_args = self._create_request_args(
178 data,
179 initial_args,
(…)
183 custom_attributes,
184 )
→ 185 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
186 return self._handle_response(response)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/botocore/client.py:535, in ClientCreator._create_api_method.._api_call(self, *args, **kwargs)
531 raise TypeError(
532 f"{py_operation_name}() only accepts keyword arguments."
533 )
534 # The “self” in this scope is referring to the BaseClient.
→ 535 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/botocore/client.py:980, in BaseClient._make_api_call(self, operation_name, api_params)
978 error_code = parsed_response.get(“Error”, {}).get(“Code”)
979 error_class = self.exceptions.from_code(error_code)
→ 980 raise error_class(parsed_response, operation_name)
981 else:
982 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
“code”: 400,
“type”: “InternalServerException”,
“message”: “\u0027llama\u0027”
}
". See https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-inference-2023-09-22-11-46-53-545 in account 346264802683 for more information.

Below is the cloud watch logs

Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-09-22T16:51:20,034 [INFO ] main com.amazonaws.ml.mms.ModelServer -
2023-09-22T16:51:20,038 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: /opt/ml/model preload_model: false
2023-09-22T16:51:20,067 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model
2023-09-22T16:51:20,124 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /opt/ml/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp
2023-09-22T16:51:20,126 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2023-09-22T16:51:20,126 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 54
2023-09-22T16:51:20,127 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.9.13
2023-09-22T16:51:20,128 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded.
2023-09-22T16:51:20,132 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-09-22T16:51:20,139 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000


2023-09-22T16:51:20,186 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
Model server started.
2023-09-22T16:51:20,272 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2023-09-22T16:51:22,461 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ‘llama’
2023-09-22T16:51:22,464 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.load(self.model_dir)
2023-09-22T16:51:22,465 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = get_pipeline(task=os.environ[“HF_TASK”], model_dir=model_dir, device=self.device)
2023-09-22T16:51:22,467 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)


2023-09-22T16:51:22,470 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - KeyError: ‘llama’
2023-09-22T16:51:22,473 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-model-1
2023-09-22T16:52:29,088 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:42492 “GET /ping HTTP/1.1” 200 0
2023-09-22T16:52:32,706 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2


2023-09-22T16:52:32,706 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error

2023-09-22T16:52:32,711 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py”, line 579, in getitem

2023-09-22T16:52:32,711 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - KeyError: ‘llama’

2023-09-22T16:52:32,711 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:

2023-09-22T16:52:32,712 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File “/opt/conda/lib/python3.9/site-packages/mms/service.py”, line 108, in predict

2023-09-22T16:52:32,713 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise PredictionException(str(e), 400)

2023-09-22T16:52:32,713 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: ‘llama’ : 400

@philschmid @marshmellow77 Please help. Thanks!!

You should use the LLM container for deploying your model see: Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker