"No space left on device" when using HuggingFace + SageMaker

MaximusDecimusMeridi · April 21, 2022, 2:21am

Thanks for clarification. It looks to me under “Training DLC Overview” that the versions should be

estimator = HuggingFace(
                entry_point          = 'train.py',        # fine-tuning script used in training jon
                source_dir           = 'embed_source',      # directory where fine-tuning script is stored
                instance_type        = instance_type,   # instances type used for the training job
                instance_count       = 1,                 # the number of instances used for training
                role                 = get_execution_role(), # Iam role used in training job to access AWS ressources, 
                transformers_version = '4.17.0',             # the transformers version used in the training job
                max_run= 36000,
                pytorch_version      = '1.10.2',             # the pytorch_version version used in the training job
                py_version           = 'py38',            # the python version used in the training job
                hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
                metric_definitions   = metric_definitions, # the metrics regex definitions to extract logs
                output_path=os.path.join(dataconnector.version_s3_prefix,  "models"),
                code_location=os.path.join(dataconnector.version_s3_prefix,  "models"),
                volume_size = 200,
                checkpoint_s3_uri='s3://kj-temp/checkpoints'

However, this fails with

ClientError: TrainingHostAgent Initialization failed:API error (404): manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu110-ubuntu18.04 not found: manifest unknown: Requested image not found

Based on this post I tried passing

image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04-v1.0',

But now the training job fails with

FileNotFoundError: [Errno 2] No such file or directory: 'train'

What should I be passing to use the latest DLC in training?

Topic		Replies	Views
SageMaker OS Error No Space Left On Device while trying to train Falcon40B Amazon SageMaker	3	1299	August 24, 2023
"no space left on device" when downloading a large model for the Sagemaker training job Amazon SageMaker	4	4938	July 18, 2024
Sagemaker gpt-j train file error Amazon SageMaker	27	2908	February 22, 2024
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1685	June 16, 2023
HF cache no space left on device 🤗Datasets	5	210	March 31, 2025

"No space left on device" when using HuggingFace + SageMaker

Related topics