SageMaker OS Error No Space Left On Device while trying to train Falcon40B

Before start I must say that I have already seen “No space left on device” when using HuggingFace + SageMaker - Amazon SageMaker - Hugging Face Forums but sadly there were no workarounds or fix for us. Thanks in advance and hope my explanation will be enough.

huggingface_estimator = HuggingFace(
        entry_point='',  # train script
        source_dir='other_scripts',  # directory which includes all the files needed for training
        instance_type='ml.g5.12xlarge',  # instances type used for the training job
        instance_count=1,  # the number of instances used for training
        base_job_name=job_name,  # the name of the training job
        role=sage_maker.role,  # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size=300,  # the size of the EBS volume in GB
        # transformers_version = '4.26',            # the transformers version used in the training job
        # pytorch_version      = '1.13',            # the pytorch_version version used in the training job
        py_version='py310',  # the python version used in the training job
    ), wait=True)

Hyperparameters for training;

hyperparameters = {
        'model_id': "tiiuae/falcon-40b",
        'epochs': 30,
        'lr': 2e-5,
        'bf16': True,
        'lora_r': 4,
        'lora_alpha': 8,
        'lora_dropout': 0.05,
        'max_seq_length': 4096,
        'lr_scheduler_type': "constant",
        'train_dataset_path': training_data_path,
        'test_dataset_path': test_data_path,
        'hf_hub_access_token': os.getenv("HF_HUB_ACCESS_TOKEN"),

LLM image;

llm_image = \
        "" \


trl == 0.4.7
transformers == 4.31.0
peft == 0.4.0
bitsandbytes == 0.40.2
accelerate == 0.21.0

And finally we are using very similar version of (

Here are some metrics about the instance;

Received Error

Downloading (…)l-00007-of-00009.bin:  61%|██████    | 5.82G/9.51G [01:52<48:59, 1.26MB/s]
Downloading shards:  67%|██████▋   | 6/9 [08:47<04:23, 87.91s/it]
Traceback (most recent call last):
File "/opt/ml/code/", line 222, in <module>
    model, peft_config, tokenizer = create_and_prepare_model(script_args)
File "/opt/ml/code/", line 171, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/", line 488, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/", line 2610, in from_pretrained
    resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/", line 958, in get_checkpoint_shard_files
    cached_filename = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/", line 417, in cached_file
    resolved_file = hf_hub_download(
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/", line 120, in _inner_fn
    return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/", line 1364, in hf_hub_download
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/", line 544, in http_get
  File "/opt/conda/lib/python3.10/", line 483, in func_wrapper
    return func(*args, **kwargs)
OSError: [Errno 28] No space left on device

I hope everything is enough for reproducing the issue.


SageMaker script requirements;

sagemaker 2.177.1
boto3 1.26.161
botocore 1.29.161

Can you try setting the CACHE_DIR to /tmp. I think that’s where sagemaker attaches the volume.
See: Installation

Hello @philschmid,

Firstly thank you for your quick response!

As you suggested I tried to implement env variables to on top of the training script for caching as below

os.environ["TRANSFORMERS_CACHE"] = "/tmp/.cache"
os.environ["HF_DATASETS_CACHE"] = "/tmp/.cache"

Yet it didn’t work, then I thought maybe your LLaMa2 script got the answer and thanks to your work I managed to resolve the issue by adding
HUGGINGFACE_HUB_CACHE=“/tmp/.cache” to the estimator as below;

    huggingface_estimator = HuggingFace(
        entry_point='',  # train script
        source_dir='other_scripts',  # directory which includes all the files needed for training
        instance_type='ml.g5.12xlarge',  # instances type used for the training job
        instance_count=1,  # the number of instances used for training
        base_job_name=job_name,  # the name of the training job
        role=sage_maker.role,  # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size=300,  # the size of the EBS volume in GB
        # transformers_version = '4.26',            # the transformers version used in the training job
        # pytorch_version      = '1.13',            # the pytorch_version version used in the training job
        py_version='py310',  # the python version used in the training job
        environment={"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"}

Right now everything works perfectly!

Just a kind question, is there any documentation about that issue on the Hub if so where can I find it because if there wasn’t your script I probably try to resolve it for hours, thanks in advance!

Great you sovled it. The documentation mentions HUGGINGFACE_HUB_CACHE for sagemaker i don’T know.