CreateTrainingJob ValidationException

g3casey · August 4, 2021, 6:16pm

I am running a Hugging Face training job (which works locally) in a container running in EC2. When I call fit on the estimator I get the error:

An error occurred (ValidationException) when calling the CreateTrainingJob operation: TrainingImageConfig with TrainingRepositoryAccessMode set to VPC must be provided when using a training image from a private Docker registry. Please provideTrainingImageConfig and TrainingRepositoryAccessMode set to VPC when using a training image from a private Docker registry.

The variables mentioned do not exist in the documentation and I can’t find it in the source code: TrainingImageConfig, TrainingRepositoryAccessMode.
There are some mentions for vpc_config and vpcConfig but there doesn’t seem to be a way to pass these things through to SM from HF.

My code is basically this:

hyperparameters = {'epochs': 1,
                   'per_device_train_batch_size': 32,
                   'model_name_or_path': model_name
                   }
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type=instance_type,
    instance_count=1,
    role=role,
    image_uri='docker.artifactory.xxxx.com/yyyy/mlaeep/mlaeep.0.0.1-dev',
    transformers_version='4.4',
    # pytorch_version='1.6',
    py_version='py36',
    hyperparameters=hyperparameters
)
huggingface_estimator.fit(
    {'train': training_input_path,
     'test': test_input_path
    },
    job_name='MlAeepTrainer'
)

g3casey · August 4, 2021, 7:31pm

@philschmid or @OlivierCR may be able to help with this.

g3casey · August 4, 2021, 8:44pm

I am getting essentially the same problem when I try to deploy my model.
My code is:

# from transformers import AutoModelForTokenClassification
modelPath = 's3://<bucket>/<key>/models/checkpoint-11000-Sample.gz'
image_uri = 'docker.artifactory.xxxx.com/aaaa/mlaeep/mlaeep.0.0.1-dev'
IAM_ROLE = 'arn:aws:iam::5555555555555:role/MY_ROLE'
instance_type = 'ml.p3.2xlarge'
vpc_config = {"Subnets": ["subnet-000dd0000dddd"],
              "SecurityGroupIds": ["sg-0f0a00a000000", "sg-11889033ae0d",
                                   "sg-d0c2aaaaaaaaaa"]
              }

model = HuggingFaceModel(
    model_data=modelPath,
    role=get_execution_role(),
    image_uri=image_uri,
    transformers_version='4.6',  # was 4.4
    pytorch_version='1.7',
    name='Aegiseep',
    vpc_config=vpc_config
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

inputString = "140006  820  76  DE  Dreher  ,  Denyse  M  File:  571587  Dept:  100  Rate:  1216  99  35  00  1  216  99  1  216  99  64  72  FIT  38  45  MA  02  661  18  X  CHECK  60  85  A  401K  Voucher#"
input = {"inputs": inputString}
predictor.predict(input)
predictor.delete_endpoint()

The error stack is:

Traceback (most recent call last):
  File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/sagemaker/session.py", line 2604, in create_model
    self.sagemaker_client.create_model(**create_model_request)
  File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/caseygre/Documents/PycharmProjects/venv/lib/python3.8/site-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Using non-ECR image "docker.artifactory.us.caas.oneadp.com/aegis/mlaeep/mlaeep.0.0.1-dev" without Vpc repository access mode is not supported.
python-BaseException

philschmid · August 5, 2021, 6:57am

Hello @g3casey,

You can find the official documentation from AWS here: Use a Private Docker Registry for Real-Time Inference Containers - Amazon SageMaker
It looks like the python sdk currently does not support images from an external private registry.

I would love to know why you are planning to use a custom image rather than the DLCs we create together with AWS?

g3casey · August 5, 2021, 12:01pm

Thanks for the quick reply Philipp.
For security reasons we are not allowed to go to outside repositories directly. Our security blocks direct access to your repository. Also, our standard unix version is not the same.
Is this support on the roadmap?

g3casey · August 5, 2021, 1:50pm

@philschmid , I am wondering if the only task required to support this feature would be to add the parameter to the HF layers of code that call AWS. If so, maybe I could fork HF and do that. You might have a better idea of the level of effort (small, med, large) that task would take.

philschmid · August 5, 2021, 3:02pm

What exactly do you mean by “is this support on the roadmap”? What is this for you?

g3casey · August 5, 2021, 10:32pm

Roadmap: you said “ It looks like the python sdk currently does not support images from an external private registry.” when will the sdk support private registries?

Also:
If I forked the HF registry, I would think it would be just a matter of adding the vpc parameters to the HF code as a pass-through to AWS. does this sound right to you?

philschmid · August 6, 2021, 7:51am

Roadmap: you said “ It looks like the python sdk currently does not support images from an external private registry.” when will the sdk support private registries?

I am not sure if the sdk will ever support this. You have to open an issue in the python sagemaker sdk. Since this is not related to the HF extension.

If I forked the HF registry, I would think it would be just a matter of adding the vpc parameters to the HF code as a pass-through to AWS. does this sound right to you?

You also need to create the lambda function as shown in the documentation

Topic		Replies	Views
Huggingface Training Containers Beginners	0	286	March 15, 2024
Multi Instance Training Error Amazon SageMaker	5	1579	October 29, 2021
Sagemaker gpt-j train file error Amazon SageMaker	27	2908	February 22, 2024
Training on Sagemaker with Trainer() Instance Amazon SageMaker	6	2279	November 3, 2021
How to fix "ValueError: Need either a GLUE task or a training/validation file." Amazon SageMaker	3	976	November 2, 2021

CreateTrainingJob ValidationException

Related topics