"No space left on device" when using HuggingFace + SageMaker

Thanks for clarification. It looks to me under “Training DLC Overview” that the versions should be

estimator = HuggingFace(
                entry_point          = 'train.py',        # fine-tuning script used in training jon
                source_dir           = 'embed_source',      # directory where fine-tuning script is stored
                instance_type        = instance_type,   # instances type used for the training job
                instance_count       = 1,                 # the number of instances used for training
                role                 = get_execution_role(), # Iam role used in training job to access AWS ressources, 
                transformers_version = '4.17.0',             # the transformers version used in the training job
                max_run= 36000,
                pytorch_version      = '1.10.2',             # the pytorch_version version used in the training job
                py_version           = 'py38',            # the python version used in the training job
                hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
                metric_definitions   = metric_definitions, # the metrics regex definitions to extract logs
                output_path=os.path.join(dataconnector.version_s3_prefix,  "models"),
                code_location=os.path.join(dataconnector.version_s3_prefix,  "models"),
                volume_size = 200,
                checkpoint_s3_uri='s3://kj-temp/checkpoints'

However, this fails with

ClientError: TrainingHostAgent Initialization failed:API error (404): manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu110-ubuntu18.04 not found: manifest unknown: Requested image not found

Based on this post I tried passing

image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04-v1.0',

But now the training job fails with

FileNotFoundError: [Errno 2] No such file or directory: 'train'

What should I be passing to use the latest DLC in training?