Hello everyone,
Before start I must say that I have already seen “No space left on device” when using HuggingFace + SageMaker - Amazon SageMaker - Hugging Face Forums but sadly there were no workarounds or fix for us. Thanks in advance and hope my explanation will be enough.
If I am missing something from other topic please let me now!
Here are some info;
Estimator;
huggingface_estimator = HuggingFace(
entry_point='falcon_peft.py', # train script
source_dir='other_scripts', # directory which includes all the files needed for training
instance_type='ml.g5.12xlarge', # instances type used for the training job
instance_count=1, # the number of instances used for training
base_job_name=job_name, # the name of the training job
role=sage_maker.role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size=300, # the size of the EBS volume in GB
# transformers_version = '4.26', # the transformers version used in the training job
# pytorch_version = '1.13', # the pytorch_version version used in the training job
py_version='py310', # the python version used in the training job
hyperparameters=hyperparameters,
image_uri=llm_image,
)
huggingface_estimator.fit(data, wait=True)
Hyperparameters for training;
hyperparameters = {
'model_id': "tiiuae/falcon-40b",
'epochs': 30,
'lr': 2e-5,
'bf16': True,
'lora_r': 4,
'lora_alpha': 8,
'lora_dropout': 0.05,
'max_seq_length': 4096,
'lr_scheduler_type': "constant",
'train_dataset_path': training_data_path,
'test_dataset_path': test_data_path,
'hf_hub_access_token': os.getenv("HF_HUB_ACCESS_TOKEN"),
}
LLM image;
llm_image = \
"763104351884.dkr.ecr.eu-west-1.amazonaws.com/" \
"huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04"
Requirements
trl == 0.4.7
transformers == 4.31.0
peft == 0.4.0
bitsandbytes == 0.40.2
accelerate == 0.21.0
torch==2.0.1
wandb==0.15.5
numpy==1.25.1
And finally we are using very similar version of
falcon_peft.py (github.com)
Here are some metrics about the instance;
Received Error
Downloading (…)l-00007-of-00009.bin: 61%|██████ | 5.82G/9.51G [01:52<48:59, 1.26MB/s]
#033[A
Downloading shards: 67%|██████▋ | 6/9 [08:47<04:23, 87.91s/it]
Traceback (most recent call last):
File "/opt/ml/code/falcon_peft.py", line 222, in <module>
model, peft_config, tokenizer = create_and_prepare_model(script_args)
File "/opt/ml/code/falcon_peft.py", line 171, in create_and_prepare_model
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2610, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 958, in get_checkpoint_shard_files
cached_filename = cached_file(
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1364, in hf_hub_download
http_get(
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 544, in http_get
temp_file.write(chunk)
File "/opt/conda/lib/python3.10/tempfile.py", line 483, in func_wrapper
return func(*args, **kwargs)
OSError: [Errno 28] No space left on device
I hope everything is enough for reproducing the issue.
Edit:
SageMaker script requirements;
sagemaker 2.177.1
boto3 1.26.161
botocore 1.29.161