"No space left on device" when using HuggingFace + SageMaker

MaximusDecimusMeridi April 14, 2022, 3:23pm 20

Thanks, I updated the volume size and added checkpointing. It seems the job fails before I complete the first epoch though. My training data consists of 1.7M short text descriptions (~100 MB) and 23 classes. Would a distributed approach help here? Like in this post

Topic		Replies	Views
SageMaker OS Error No Space Left On Device while trying to train Falcon40B Amazon SageMaker	3	1358	August 24, 2023
"no space left on device" when downloading a large model for the Sagemaker training job Amazon SageMaker	4	5099	July 18, 2024
Sagemaker gpt-j train file error Amazon SageMaker	27	2965	February 22, 2024
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1710	June 16, 2023
HF cache no space left on device 🤗Datasets	5	526	March 31, 2025

"No space left on device" when using HuggingFace + SageMaker

Related topics