Deploying Mixtral8x7B on AWS Sagemaker from S3

I am trying to deploy the full-precision Mixtral 8x7B on AWS Sagemaker using the sagemaker-huggingface-inference-toolkit. I downloaded the model from the hub and compressed the modelfiles into a model.tar.gz file with all files residing directly in the root of the archive. I want to use the default code for inference and no custom

I use the following code:

import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.Session(default_bucket="some-bucket")

instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 300
s3_model_uri = f"s3://some_prefix_for_model/mistralai/Mixtral-8x7B-Instruct-v0.1/model.tar.gz"
llm_image = ''

config = {
  'HF_MODEL_ID': "/opt/ml/model",
  'SM_NUM_GPUS': json.dumps(number_of_gpu),
  'MAX_INPUT_LENGTH': json.dumps(24000),
  'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),
  'MAX_TOTAL_TOKENS': json.dumps(32000),
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),

model = HuggingFaceModel(

llm = model.deploy(

The error I am getting is :

Error hosting endpoint huggingface-pytorch-tgi-inference-2024-05-24-15-14-56-313: Failed. Reason: Request to service failed. If failure persists after retry, contact customer support.. Try changing the instance type or reference the troubleshooting page

No cloudwatch logs are written unfortunately, so I don’t know exactly what’s going wrong. When I remove the model_data dir and load the model directly from the Hub it works. This is why I am assuming that the decompression might be taking too long for the inference container to start. However, I do not know whether HugginfaceModel supports already decompressed model files, a question that was already asked here, but not answered.

Thanks for any advice on how to move on here! :slight_smile:

Update: Using local versions is working fine with the smaller & recently released Mistral 7B v3 on ml.g5.{12x,24x,48x}large machines. This strengthens my belief that it has sth. do to with the size of the model. Would be grateful for any advise!

I was able to fix it by passing


as an argument to model.deploy(…)