I’m following the Fine Tuning Llama 2 on Sagemaker walkthrough.
It’s going pretty smooth till it’s time to fit the estimator and I ran into two exceptions.
First is a Create Bucket permission denied error. I added a output_path
parameter to the hugging face estimator pointing to my s3 bucket/prefix and it now doesn’t throw that error.
However, I’m not sure if it’s “fixed” since I now get a ValueError for a missing “scripts” directory.
From the tutorial, this is the original estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_clm.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.4xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)
Updated for the outoutp_path
huggingface_estimator = HuggingFace(
output_path = 's3://mybucket/jkyle/sagemaker/output',
entry_point = 'run_clm.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.4xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)
And the exception
ValueError: No file named "run_clm.py" was found in directory "scripts".
I’m not clear on the root path it’s looking for the scripts path in? Do I need to upload ore create this somewhere?
Cheers & Thanks for an tips!