Deploying Mixtral8x7B on AWS Sagemaker from S3

I am trying to deploy the full-precision Mixtral 8x7B on AWS Sagemaker using the sagemaker-huggingface-inference-toolkit. I downloaded the model from the hub and compressed the modelfiles into a model.tar.gz file with all files residing directly in the root of the archive. I want to use the default code for inference and no custom inference.py.

I use the following code:

import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.Session(default_bucket="some-bucket")

instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 300
s3_model_uri = f"s3://some_prefix_for_model/mistralai/Mixtral-8x7B-Instruct-v0.1/model.tar.gz"
llm_image = '763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.1-gpu-py310-cu121-ubuntu22.04'

config = {
  'HF_MODEL_ID': "/opt/ml/model",
  'SM_NUM_GPUS': json.dumps(number_of_gpu),
  'MAX_INPUT_LENGTH': json.dumps(24000),
  'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),
  'MAX_TOTAL_TOKENS': json.dumps(32000),
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
}

model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config,
  model_data=s3_model_uri,
)

llm = model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

The error I am getting is :

Error hosting endpoint huggingface-pytorch-tgi-inference-2024-05-24-15-14-56-313: Failed. Reason: Request to service failed. If failure persists after retry, contact customer support.. Try changing the instance type or reference the troubleshooting page https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html

No cloudwatch logs are written unfortunately, so I don’t know exactly what’s going wrong. When I remove the model_data dir and load the model directly from the Hub it works. This is why I am assuming that the decompression might be taking too long for the inference container to start. However, I do not know whether HugginfaceModel supports already decompressed model files, a question that was already asked here, but not answered.

Thanks for any advice on how to move on here! :slight_smile:

Update: Using local versions is working fine with the smaller & recently released Mistral 7B v3 on ml.g5.{12x,24x,48x}large machines. This strengthens my belief that it has sth. do to with the size of the model. Would be grateful for any advise!

I was able to fix it by passing

model_data_download_timeout=3600

as an argument to model.deploy(…)