I am trying to deploy the full-precision Mixtral 8x7B on AWS Sagemaker using the sagemaker-huggingface-inference-toolkit. I downloaded the model from the hub and compressed the modelfiles into a model.tar.gz file with all files residing directly in the root of the archive. I want to use the default code for inference and no custom inference.py.
I use the following code:
import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.Session(default_bucket="some-bucket")
instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 300
s3_model_uri = f"s3://some_prefix_for_model/mistralai/Mixtral-8x7B-Instruct-v0.1/model.tar.gz"
llm_image = '763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.1-gpu-py310-cu121-ubuntu22.04'
config = {
'HF_MODEL_ID': "/opt/ml/model",
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(24000),
'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),
'MAX_TOTAL_TOKENS': json.dumps(32000),
'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
}
model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config,
model_data=s3_model_uri,
)
llm = model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
)
The error I am getting is :
Error hosting endpoint huggingface-pytorch-tgi-inference-2024-05-24-15-14-56-313: Failed. Reason: Request to service failed. If failure persists after retry, contact customer support.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html
No cloudwatch logs are written unfortunately, so I don’t know exactly what’s going wrong. When I remove the model_data dir and load the model directly from the Hub it works. This is why I am assuming that the decompression might be taking too long for the inference container to start. However, I do not know whether HugginfaceModel supports already decompressed model files, a question that was already asked here, but not answered.
Thanks for any advice on how to move on here!