SageMaker OS Error No Space Left On Device while trying to train Falcon40B

Hello everyone,

Before start I must say that I have already seen “No space left on device” when using HuggingFace + SageMaker - Amazon SageMaker - Hugging Face Forums but sadly there were no workarounds or fix for us. Thanks in advance and hope my explanation will be enough.

If I am missing something from other topic please let me now!

Here are some info;

Estimator;

huggingface_estimator = HuggingFace(
        entry_point='falcon_peft.py',  # train script
        source_dir='other_scripts',  # directory which includes all the files needed for training
        instance_type='ml.g5.12xlarge',  # instances type used for the training job
        instance_count=1,  # the number of instances used for training
        base_job_name=job_name,  # the name of the training job
        role=sage_maker.role,  # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size=300,  # the size of the EBS volume in GB
        # transformers_version = '4.26',            # the transformers version used in the training job
        # pytorch_version      = '1.13',            # the pytorch_version version used in the training job
        py_version='py310',  # the python version used in the training job
        hyperparameters=hyperparameters,
        image_uri=llm_image,
    )


    huggingface_estimator.fit(data, wait=True)

Hyperparameters for training;

hyperparameters = {
        'model_id': "tiiuae/falcon-40b",
        'epochs': 30,
        'lr': 2e-5,
        'bf16': True,
        'lora_r': 4,
        'lora_alpha': 8,
        'lora_dropout': 0.05,
        'max_seq_length': 4096,
        'lr_scheduler_type': "constant",
        'train_dataset_path': training_data_path,
        'test_dataset_path': test_data_path,
        'hf_hub_access_token': os.getenv("HF_HUB_ACCESS_TOKEN"),
    }

LLM image;

llm_image = \
        "763104351884.dkr.ecr.eu-west-1.amazonaws.com/" \
        "huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04"

Requirements

trl == 0.4.7
transformers == 4.31.0
peft == 0.4.0
bitsandbytes == 0.40.2
accelerate == 0.21.0
torch==2.0.1
wandb==0.15.5
numpy==1.25.1

And finally we are using very similar version of
falcon_peft.py (github.com)

Here are some metrics about the instance;

Received Error

Downloading (…)l-00007-of-00009.bin:  61%|██████    | 5.82G/9.51G [01:52<48:59, 1.26MB/s]
#033[A
Downloading shards:  67%|██████▋   | 6/9 [08:47<04:23, 87.91s/it]
Traceback (most recent call last):
File "/opt/ml/code/falcon_peft.py", line 222, in <module>
    model, peft_config, tokenizer = create_and_prepare_model(script_args)
File "/opt/ml/code/falcon_peft.py", line 171, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2610, in from_pretrained
    resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 958, in get_checkpoint_shard_files
    cached_filename = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1364, in hf_hub_download
    http_get(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 544, in http_get
    temp_file.write(chunk)
  File "/opt/conda/lib/python3.10/tempfile.py", line 483, in func_wrapper
    return func(*args, **kwargs)
OSError: [Errno 28] No space left on device

I hope everything is enough for reproducing the issue.

Edit:

SageMaker script requirements;

sagemaker 2.177.1
boto3 1.26.161
botocore 1.29.161

Can you try setting the CACHE_DIR to /tmp. I think that’s where sagemaker attaches the volume.
See: Installation

Hello @philschmid,

Firstly thank you for your quick response!

As you suggested I tried to implement env variables to on top of the falcon_peft.py training script for caching as below

os.environ["TRANSFORMERS_CACHE"] = "/tmp/.cache"
os.environ["HF_DATASETS_CACHE"] = "/tmp/.cache"

Yet it didn’t work, then I thought maybe your LLaMa2 script got the answer and thanks to your work I managed to resolve the issue by adding
HUGGINGFACE_HUB_CACHE=“/tmp/.cache” to the estimator as below;

    huggingface_estimator = HuggingFace(
        entry_point='falcon_peft.py',  # train script
        source_dir='other_scripts',  # directory which includes all the files needed for training
        instance_type='ml.g5.12xlarge',  # instances type used for the training job
        instance_count=1,  # the number of instances used for training
        base_job_name=job_name,  # the name of the training job
        role=sage_maker.role,  # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size=300,  # the size of the EBS volume in GB
        # transformers_version = '4.26',            # the transformers version used in the training job
        # pytorch_version      = '1.13',            # the pytorch_version version used in the training job
        py_version='py310',  # the python version used in the training job
        hyperparameters=hyperparameters,
        image_uri=llm_image,
        environment={"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"}
    )

Right now everything works perfectly!

Just a kind question, is there any documentation about that issue on the Hub if so where can I find it because if there wasn’t your script I probably try to resolve it for hours, thanks in advance!

Great you sovled it. The documentation mentions HUGGINGFACE_HUB_CACHE for sagemaker i don’T know.