Hello, thanks for reading. I am having many issues deploying LLM models on sagemaker. I have been able to get the canned AWS foundation models deployed, but when I try to use one off of HF hub I always get a similar error. Here is the error I am getting trying to deploy anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g',
'HF_TASK':'text-generation'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.g4dn.12xlarge' # ec2 instance type
)
The endpoint deploys successfully, but when querying the endpoint I get the following error:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "\u0027llama\u0027"
}
The cloud watch logs don’t provide any useful information that I can see. Any thoughts?
not a fix, but an explanation. the current sagemaker instances dont support the transformers version the llama models were trained on, so cant support them in general. i dont think there is a fix, at least that i have been able to find
This will not work for gpt4-x-alpaca-13b-native-4bit-128g since it requires the GPTQ package. Therefore you need to create a custom infernece.py script and add the latest transforemrs version + gptq with a requirements.txt
this didnt work for me, i got an error in cloud logs like “huggingface_hub.utils._errors.LocalEntryNotFoundError: File aining_args.safetensors of model ehartford/Wizard-Vicuna-13B-Uncensored not found in /tmp. Please run text-generation-server download-weights ehartford/Wizard-Vicuna-13B-Uncensored first.”
i did deploy the model in a hacky way using a pure ec2 instance, fastapi, llamacpp, and nginx