Failed. Reason: Please make sure all images included in the model for the production variant AllTraffic exist, and that the execution role used to create the model has permissions to access them

Hey!

Been experiencing this error and have tried diagnosing but not sure where the problem might be.

UnexpectedStatusException: Error hosting endpoint summarization-endpoint: Failed. Reason:  Please make sure all images included in the model for the production variant AllTraffic exist, and that the execution role used to create the model has permissions to access them..

I initially thought it had to do with IAM Permissions but I am no longer sure that is the case since I added all permissions that might be relevant and I don’t think it’s an issue of the resource not being assigned to the right policy. The model is being created but the endpoint is not being processed correctly. I also considered whether my model.tar.gz was corrupted but even when I tried uploading a model directly from the Hugging Face Hub, I am met with this error message. For some reason as well, no CloudWatch logs are being saved despite the CloudWatch Log Group being created for /aws/sagemaker/Endpoints/summarization-endpoint and having all relevant permissions.

The script is below:

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import BytesDeserializer
import sagemaker

model_name = 'summarization-model'
endpoint_name = 'summarization-endpoint'

role = sagemaker.get_execution_role()


# Hub Model configuration. https://huggingface.co/models
# hub = {
# 	'HF_MODEL_ID':'google/pegasus-large',
# 	'HF_TASK':'summarization'
# }

# # create Hugging Face Model Class
# huggingface_model = HuggingFaceModel(
# 	transformers_version='4.6.1',
# 	pytorch_version='1.7.1',
# 	py_version='py36',
# 	env=hub,
# 	role=role, 
# )

# # create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://qfn-transcription/ujjawal_files/model.tar.gz",  # path to your trained sagemaker model
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6.1", # transformers version used
   pytorch_version="1.7.1", # pytorch version used
   py_version='py36',
   name=model_name
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge',#'ml.m5.xlarge',ml.inf1.xlarge
    endpoint_name=endpoint_name, 
)

predictor.predict({
	'inputs': "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
})

Thanks!

IAM Permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "iam:GetRole",
                "iam:PassRole",
                "sagemaker:GetRecord"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:216283767174:feature-group/*",
                "arn:aws:s3:::qfn-transcription/*",
                "arn:aws:iam::216283767174:role/callTranscriptionsRole"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModel",
                "logs:GetLogRecord",
                "logs:DescribeSubscriptionFilters",
                "logs:StartQuery",
                "logs:DescribeMetricFilters",
                "ecr:BatchDeleteImage",
                "logs:ListLogDeliveries",
                "ecr:DeleteRepository",
                "logs:CreateLogStream",
                "logs:TagLogGroup",
                "logs:CancelExportTask",
                "logs:GetLogEvents",
                "logs:FilterLogEvents",
                "logs:DescribeDestinations",
                "sagemaker:CreateEndpoint",
                "logs:StopQuery",
                "cloudwatch:GetMetricStatistics",
                "logs:CreateLogGroup",
                "ecr:PutImage",
                "logs:PutMetricFilter",
                "logs:CreateLogDelivery",
                "servicecatalog:ListAcceptedPortfolioShares",
                "sagemaker:CreateEndpointConfig",
                "logs:PutResourcePolicy",
                "logs:DescribeExportTasks",
                "sagemaker:ListActions",
                "logs:GetQueryResults",
                "sagemaker:DescribeEndpointConfig",
                "logs:UpdateLogDelivery",
                "ecr:BatchGetImage",
                "logs:PutSubscriptionFilter",
                "ecr:InitiateLayerUpload",
                "logs:ListTagsLogGroup",
                "sagemaker:EnableSagemakerServicecatalogPortfolio",
                "logs:DescribeLogStreams",
                "ecr:UploadLayerPart",
                "logs:GetLogDelivery",
                "cloudwatch:ListMetrics",
                "servicecatalog:AcceptPortfolioShare",
                "logs:CreateExportTask",
                "ecr:CompleteLayerUpload",
                "logs:AssociateKmsKey",
                "sagemaker:DescribeEndpoint",
                "logs:DescribeQueryDefinitions",
                "logs:PutDestination",
                "logs:DescribeResourcePolicies",
                "ecr:DeleteRepositoryPolicy",
                "logs:DescribeQueries",
                "logs:DisassociateKmsKey",
                "sagemaker:DeleteApp",
                "logs:UntagLogGroup",
                "logs:DescribeLogGroups",
                "logs:PutDestinationPolicy",
                "logs:TestMetricFilter",
                "logs:PutQueryDefinition",
                "logs:DeleteDestination",
                "logs:PutLogEvents",
                "s3:ListAllMyBuckets",
                "ecr:SetRepositoryPolicy",
                "logs:PutRetentionPolicy",
                "logs:GetLogGroupFields"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "sagemaker:CreateApp"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:216283767174:app/*/*/*/*",
                "arn:aws:s3:::qfn-transcription/*"
            ]
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeApp",
            "Resource": "arn:aws:sagemaker:*:216283767174:app/*/*/*/*"
        },
        {
            "Sid": "VisualEditor4",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeTrainingJob",
                "sagemaker:CreateMonitoringSchedule",
                "sagemaker:PutRecord",
                "sagemaker:CreateTrainingJob",
                "sagemaker:CreateProcessingJob"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:216283767174:feature-group/*",
                "arn:aws:sagemaker:*:216283767174:monitoring-schedule/*",
                "arn:aws:sagemaker:*:216283767174:processing-job/*",
                "arn:aws:sagemaker:*:216283767174:training-job/*"
            ]
        },
        {
            "Sid": "VisualEditor5",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeNotebookInstanceLifecycleConfig",
                "sagemaker:StopNotebookInstance",
                "sagemaker:DescribeNotebookInstance"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:216283767174:feature-group/*",
                "arn:aws:sagemaker:*:216283767174:notebook-instance-lifecycle-config/*",
                "arn:aws:sagemaker:*:216283767174:notebook-instance/*"
            ]
        },
        {
            "Sid": "VisualEditor6",
            "Effect": "Allow",
            "Action": [
                "ecr:SetRepositoryPolicy",
                "ecr:CompleteLayerUpload",
                "ecr:BatchGetImage",
                "ecr:BatchDeleteImage",
                "ecr:UploadLayerPart",
                "ecr:DeleteRepositoryPolicy",
                "ecr:InitiateLayerUpload",
                "ecr:DeleteRepository",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/*"
        }
    ]
}```

Hey @ujjirox,

The error is 100% related to your IAM Permissions. See:

I saw that you edited you IAM permissions manually (VisualEditorX), e.g I miss ecr:GetDownloadUrlForLayer for downloading the ecr correctly.

Can you test the deployment with the AmazonSageMakerFullAccess see here: SageMaker Roles - Amazon SageMaker

The IAM managed policy, AmazonSageMakerFullAccess , used in the following procedure only grants the execution role permission to perform certain Amazon S3 actions on buckets or objects with SageMaker , Sagemaker , sagemaker , or aws-glue in the name. To learn how to add an additional policy to an execution role to grant it access to other Amazon S3 buckets and objects, see Add Additional Amazon S3 Permissions to an SageMaker Execution Role.

If this works you can take a look at more detailed permissions here: SageMaker Roles - Amazon SageMaker
And then create a new clean role.

Thanks for your reply. You were indeed correct, it is in fact an IAM policy related issue. Even the AmazonSageMakerFullAccess Policy didn’t do the job actually. I had to create a separate overly permissive policy but ultimately was able to deploy it.

Thanks!

1 Like

Hey @philschmid. I get a similar error but the error statement is a bit different:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-inference-2021-11-16-15-09-40-461: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

Do you think this is also an IAM Permissions issue or should I open up a new topic for this?

Many thanks,

Karim Foda

Feel free to open a new Thread for this!

The error means that SageMaker couldn’t create successfully your endpoint. The reasons for it could be different. Let’s discuss and solve them in a different thread!
When opening it could you check the logs in Cloudwatch you get for these endpoints? maybe they are already telling us what the issue is?

1 Like

Thanks @philschmid. This turned out to be due to a type in my HF token so a permissions issues on my HF account side. It’s all resolved now. Sharing here for anyone who runs into a similar issue.

1 Like

@kmfoda I’m also facing the same issue, could please help me out from this?

hey @shreethamarai. My issue was mostly due to the fact that I had incorrectly typed the name of the HF TOKEN parameter in my ping to AWS. Correcting that typo resolved this issue for me.

i am getting this error. For me it says that it cant recognise the word “falcon”. couldnt find a solution. can anyone please guide ?

@gill13 could you please share the code you used?

import boto3
import sagemaker
from sagemaker import huggingface
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

sagemaker_session = sagemaker.Session()
role = “”

image uri

llm_image = get_huggingface_llm_image_uri(“huggingface”)

print(f"image uri: {llm_image}")

Falcon 7b

hub = {‘HF_MODEL_ID’: ‘vilsonrodrigues/falcon-7b-instruct-sharded’}
model_file = “s3://sagemaker-ap-southeast-2-229322283192/my-model/falconmodel.tar.gz”

Hugging Face Model Class

huggingface_model = HuggingFaceModel(
env=hub,
role=role, # iam role from AWS
image_uri=llm_image,
sagemaker_session=sagemaker_session
)

deploy model to SageMaker

predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type=‘ml.g5.2xlarge’, # ‘ml.g5.4xlarge’
container_startup_health_check_timeout=300
)

@philschmid can you please guide. I am stuck

I am getting this error ,can anyone please resolve it. Thanks

@gill13 can you try the container version 1.0.3?

i’m having the same issue, @philschmid


give below the code im using, i have finetune the llama2-13b model and have saved the model file to s3

instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

# TGI config
config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-13b-hf", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data = s3_model_uri,
  env=config
)
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.0.3"
)

its not getting failed at the beginning, only after some time

Llama models are gated you need to provide a HF token. See Deploy Llama 2 7B/13B/70B on Amazon SageMaker

1 Like

When i did
huggingface-cli login --token <REPLACE WITH YOUR TOKEN>

Token is valid (permission: write).
Your token has

but when i tried
assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"

getting
AssertionError: Please set your Hugging Face Hub token

token have write permission
its should ideally be working right?

Any help to the above question is appreciated