ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker

Jorgeutd · July 21, 2021, 8:11pm

ValueError: Source directory does not exist in the repo. (Looks like the link is broken).

I am getting the error above when I am trying to train a text generation model in the sagemaker.

Please see the script and configuration that I am using:

import sagemaker
from sagemaker.huggingface import HuggingFace

git_config = {‘repo’: ‘https://github.com/huggingface/transformers.git’,'branch’: ‘v4.6.1’} # v4.6.1 is referring to the transformers_version you use in the estimator.

gets role for executing training job

role = sagemaker.get_execution_role()
hyperparameters = {
‘model_name_or_path’:‘ktangri/gpt-neo-demo’,
‘output_dir’:’/opt/ml/model’,
‘fp16’: True,
‘train_file’: ‘/opt/ml/input/data/train/train.csv’,
‘validation_file’: ‘/opt/ml/input/data/validation/validation.csv’
# add your remaining hyperparameters
# more info here https://github.com/huggingface/transformers/tree/v4.6.1/examples/language-modeling
}

#configuration for running training on smdistributed Data Parallel
distribution = {‘smdistributed’:{‘dataparallel’:{ ‘enabled’: True}}}

git configuration to download our fine-tuning script

git_config = {‘repo’: ‘https://github.com/huggingface/transformers.git’,'branch’: ‘v4.6.1’}

creates Hugging Face estimator

huggingface_estimator = HuggingFace(
entry_point=‘run_clm.py’,
source_dir=’/examples/language-modeling’,
instance_type=‘ml.p3.16xlarge’,
instance_count=2,
role=role,
git_config=git_config,
transformers_version=‘4.6.1’,
pytorch_version=‘1.7.1’,
py_version=‘py36’,
hyperparameters = hyperparameters
)

huggingface_estimator.fit(
{‘train’: ‘s3://ch-questions-dataset-east1/train/train.csv’,
‘validation’: ‘s3://ch-questions-dataset-east1/validation/validation.csv’}
)

Thank you.

philschmid · July 22, 2021, 9:37am

Thanks for pointing this out with transformers 4.6.1 the examples/ structure changed.
It is for source_dir now →

source_dir=’examples/pytorch/language-modeling’,

We’ll fix the code snippet on the hub.

Jorgeutd · July 22, 2021, 10:06am

Thank you Philipp. I will make this changes. It will be great if you guys can create a notebook example / post for this type of task.

In terms of data set up, I reviewed the training script and correct me if I am wrong but I did not see any requirements to add start text or end text prefixes to the data? Just a column name text with the input text data?

Thank you again. Great job as always

philschmid · July 23, 2021, 8:29am

Hey @Jorgeutd,

Do you mean by adding a prefix to the data something like that?

>>> def add_prefix(example):
...     example['sentence1'] = 'My sentence: ' + example['sentence1']
...     return example
...
>>> updated_dataset = small_dataset.map(add_prefix)
>>> updated_dataset['sentence1'][:5]
['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 "My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
 'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
 'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
]

If you would like to do something like that you could either copy and modify the script or create an additional script for pre-processing and then saving your data as csv with the prefix and then loading it using run_clm.py

Jorgeutd · July 23, 2021, 3:13pm

Thank you Philipp. That is what ended up doing. I having a lot of issues when I am trying to deploy an endpoint with custom inference py code, and using the:
model_data=‘s3://sagemaker-us-east-1-197614225699/huggingface-pytorch-training-2021-07-22-18-47-50-728/output/model.tar.gz’, # local path where *.targ.gz is saved

I keep getting multiples OSError: [Errno 28] No space left on device which does not aling with the instance space available. I think this is more on the amazon side.

Jorgeutd · July 23, 2021, 3:15pm

This is the error:

OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
419 with tarfile.open(tmp_model_path, mode=“w:gz”) as t:
→ 420 t.add(model_dir, arcname=os.path.sep)
421

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in add(self, name, arcname, recursive, exclude, filter)
1960 self.add(os.path.join(name, f), os.path.join(arcname, f),
→ 1961 recursive, exclude, filter=filter)
1962

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in add(self, name, arcname, recursive, exclude, filter)
1953 with bltn_open(name, “rb”) as f:
→ 1954 self.addfile(tarinfo, f)
1955

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in addfile(self, tarinfo, fileobj)
1981 if fileobj is not None:
→ 1982 copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
1983 blocks, remainder = divmod(tarinfo.size, BLOCKSIZE)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in copyfileobj(src, dst, length, exception, bufsize)
251 raise exception(“unexpected end of data”)
→ 252 dst.write(buf)
253

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in write(self, data)
263 if length > 0:
→ 264 self.fileobj.write(self.compress.compress(data))
265 self.size += length

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in close(self)
308 if self.mode == WRITE:
→ 309 fileobj.write(self.compress.flush())
310 write32u(fileobj, self.crc)

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
in
5 predictor = model_for_deployment.deploy(initial_instance_count=1,
6 instance_type=“ml.g4dn.2xlarge”,
----> 7 endpoint_name=endpoint_name
8 )

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
709 self._base_name = “-”.join((self._base_name, compiled_model_suffix))
710
→ 711 self._create_sagemaker_model(instance_type, accelerator_type, tags)
712 production_variant = sagemaker.production_variant(
713 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
263 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
264 “”"
→ 265 container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
266
267 self._ensure_base_name_if_needed(container_def[“Image”])

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/huggingface/model.py in prepare_container_def(self, instance_type, accelerator_type)
269
270 deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
→ 271 self._upload_code(deploy_key_prefix, repack=True)
272 deploy_env = dict(self.env)
273 deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
1088 repacked_model_uri=repacked_model_data,
1089 sagemaker_session=self.sagemaker_session,
→ 1090 kms_key=self.model_kms_key,
1091 )
1092

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
418 tmp_model_path = os.path.join(tmp, “temp-model.tar.gz”)
419 with tarfile.open(tmp_model_path, mode=“w:gz”) as t:
→ 420 t.add(model_dir, arcname=os.path.sep)
421
422 _save_model(repacked_model_uri, tmp_model_path, sagemaker_session, kms_key=kms_key)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in exit(self, type, value, traceback)
2439 # it would try to write end-of-archive blocks and padding.
2440 if not self._extfileobj:
→ 2441 self.fileobj.close()
2442 self.closed = True
2443

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in close(self)
317 if myfileobj:
318 self.myfileobj = None
→ 319 myfileobj.close()
320
321 def flush(self,zlib_mode=zlib.Z_SYNC_FLUSH):

OSError: [Errno 28] No space left on device

philschmid · July 26, 2021, 6:59am

Hey @Jorgeutd,
could you provide a few information for of you model.tar.gz.

What is the size of the archive?
What is the size of the archive unzipped?
How does the file structure look like?

cc @dan21c

Jorgeutd · July 26, 2021, 2:28pm

[quote=“philschmid, post:7, topic:8566”]
could you provide a few information for of you model.tar.gz .

What is the size of the archive?
Size: 8.1 GB
What is the size of the archive unzipped?
Not sure about this. I think the checkpoint have the models saved and optimizer but I can delete the checkpoints.
How does the file structure look like?

At this point, I tried different instances configuration to try to deploy with custom inference code and nothing works.

philschmid · July 26, 2021, 3:38pm

Could try to create the model.tar.gz by hand and only include the “inference” relevant files and leave out the checkpoints? And test it again.

We will look into it.

Topic		Replies	Views
Sagemaker gpt-j train file error Amazon SageMaker	27	2908	February 22, 2024
Package errors running huggingface estimator on sagemaker Beginners	1	937	February 9, 2023
How to fix "ValueError: Need either a GLUE task or a training/validation file." Amazon SageMaker	3	976	November 2, 2021
Repository Not Found Error when using custom dataset to train model on SageMaker 🤗Datasets	3	2240	February 15, 2023
Fine tuning Llama 2 walkthrough missing scripts directory error Beginners	5	808	October 3, 2023

ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker

gets role for executing training job

git configuration to download our fine-tuning script

creates Hugging Face estimator

Related topics