I am running a Hugging Face training job (which works locally) in a container running in EC2. When I call fit on the estimator I get the error:
An error occurred (ValidationException) when calling the CreateTrainingJob operation: TrainingImageConfig with TrainingRepositoryAccessMode set to VPC must be provided when using a training image from a private Docker registry. Please provideTrainingImageConfig and TrainingRepositoryAccessMode set to VPC when using a training image from a private Docker registry.
The variables mentioned do not exist in the documentation and I can’t find it in the source code: TrainingImageConfig, TrainingRepositoryAccessMode.
There are some mentions for vpc_config and vpcConfig but there doesn’t seem to be a way to pass these things through to SM from HF.
My code is basically this:
hyperparameters = {'epochs': 1,
'per_device_train_batch_size': 32,
'model_name_or_path': model_name
}
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type=instance_type,
instance_count=1,
role=role,
image_uri='docker.artifactory.xxxx.com/yyyy/mlaeep/mlaeep.0.0.1-dev',
transformers_version='4.4',
# pytorch_version='1.6',
py_version='py36',
hyperparameters=hyperparameters
)
huggingface_estimator.fit(
{'train': training_input_path,
'test': test_input_path
},
job_name='MlAeepTrainer'
)
@philschmid or @OlivierCR may be able to help with this.
I am getting essentially the same problem when I try to deploy my model.
My code is:
# from transformers import AutoModelForTokenClassification
modelPath = 's3://<bucket>/<key>/models/checkpoint-11000-Sample.gz'
image_uri = 'docker.artifactory.xxxx.com/aaaa/mlaeep/mlaeep.0.0.1-dev'
IAM_ROLE = 'arn:aws:iam::5555555555555:role/MY_ROLE'
instance_type = 'ml.p3.2xlarge'
vpc_config = {"Subnets": ["subnet-000dd0000dddd"],
"SecurityGroupIds": ["sg-0f0a00a000000", "sg-11889033ae0d",
"sg-d0c2aaaaaaaaaa"]
}
model = HuggingFaceModel(
model_data=modelPath,
role=get_execution_role(),
image_uri=image_uri,
transformers_version='4.6', # was 4.4
pytorch_version='1.7',
name='Aegiseep',
vpc_config=vpc_config
)
predictor = model.deploy(
initial_instance_count=1,
instance_type=instance_type,
)
inputString = "140006 820 76 DE Dreher , Denyse M File: 571587 Dept: 100 Rate: 1216 99 35 00 1 216 99 1 216 99 64 72 FIT 38 45 MA 02 661 18 X CHECK 60 85 A 401K Voucher#"
input = {"inputs": inputString}
predictor.predict(input)
predictor.delete_endpoint()
The error stack is:
Traceback (most recent call last):
File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/sagemaker/session.py", line 2604, in create_model
self.sagemaker_client.create_model(**create_model_request)
File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Using non-ECR image "docker.artifactory.us.caas.oneadp.com/aegis/mlaeep/mlaeep.0.0.1-dev" without Vpc repository access mode is not supported.
python-BaseException
Hello @g3casey,
You can find the official documentation from AWS here: Use a Private Docker Registry for Real-Time Inference Containers - Amazon SageMaker
It looks like the python sdk currently does not support images from an external private registry.
I would love to know why you are planning to use a custom image rather than the DLCs we create together with AWS?
Thanks for the quick reply Philipp.
For security reasons we are not allowed to go to outside repositories directly. Our security blocks direct access to your repository. Also, our standard unix version is not the same.
Is this support on the roadmap?
@philschmid , I am wondering if the only task required to support this feature would be to add the parameter to the HF layers of code that call AWS. If so, maybe I could fork HF and do that. You might have a better idea of the level of effort (small, med, large) that task would take.
What exactly do you mean by “is this support on the roadmap”? What is this for you?
Roadmap: you said “ It looks like the python sdk currently does not support images from an external private registry.” when will the sdk support private registries?
Also:
If I forked the HF registry, I would think it would be just a matter of adding the vpc parameters to the HF code as a pass-through to AWS. does this sound right to you?
Roadmap: you said “ It looks like the python sdk currently does not support images from an external private registry.” when will the sdk support private registries?
I am not sure if the sdk will ever support this. You have to open an issue in the python sagemaker sdk. Since this is not related to the HF extension.
If I forked the HF registry, I would think it would be just a matter of adding the vpc parameters to the HF code as a pass-through to AWS. does this sound right to you?
You also need to create the lambda function as shown in the documentation