Fine tuning Llama 2 walkthrough missing scripts directory error

I’m following the Fine Tuning Llama 2 on Sagemaker walkthrough.

It’s going pretty smooth till it’s time to fit the estimator and I ran into two exceptions.

First is a Create Bucket permission denied error. I added a output_path parameter to the hugging face estimator pointing to my s3 bucket/prefix and it now doesn’t throw that error.

However, I’m not sure if it’s “fixed” since I now get a ValueError for a missing “scripts” directory.

From the tutorial, this is the original estimator

huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

Updated for the outoutp_path

huggingface_estimator = HuggingFace(
    output_path          = 's3://mybucket/jkyle/sagemaker/output',
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

And the exception

ValueError: No file named "run_clm.py" was found in directory "scripts".

I’m not clear on the root path it’s looking for the scripts path in? Do I need to upload ore create this somewhere?

Cheers & Thanks for an tips!

@jkyle James, just to check if you were able to get past this ?.
Facing the same issue as well following the same reference
@philschmid

Not yet.

But I haven’t tried (and will next)

  • Create the scripts directory in the same directory as the local notebook I’m running.
  • See if I can copy a scripts directory over to the session. this is a little opaque to me

Did you figure this out yet? I’m facing the same issue too. @philschmid Can you please help with this? Thanks in advance!

Ok I think I missed it - in the article it says the file can be found here: https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/scripts/run_clm.py

any solution for the same? Facing same error