Thanks for clarification. It looks to me under “Training DLC Overview” that the versions should be
estimator = HuggingFace(
entry_point = 'train.py', # fine-tuning script used in training jon
source_dir = 'embed_source', # directory where fine-tuning script is stored
instance_type = instance_type, # instances type used for the training job
instance_count = 1, # the number of instances used for training
role = get_execution_role(), # Iam role used in training job to access AWS ressources,
transformers_version = '4.17.0', # the transformers version used in the training job
max_run= 36000,
pytorch_version = '1.10.2', # the pytorch_version version used in the training job
py_version = 'py38', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameter used for running the training job
metric_definitions = metric_definitions, # the metrics regex definitions to extract logs
output_path=os.path.join(dataconnector.version_s3_prefix, "models"),
code_location=os.path.join(dataconnector.version_s3_prefix, "models"),
volume_size = 200,
checkpoint_s3_uri='s3://kj-temp/checkpoints'
However, this fails with
ClientError: TrainingHostAgent Initialization failed:API error (404): manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu110-ubuntu18.04 not found: manifest unknown: Requested image not found
Based on this post I tried passing
image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04-v1.0',
But now the training job fails with
FileNotFoundError: [Errno 2] No such file or directory: 'train'
What should I be passing to use the latest DLC in training?