ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker

ValueError: Source directory does not exist in the repo. (Looks like the link is broken).

I am getting the error above when I am trying to train a text generation model in the sagemaker.

Please see the script and configuration that I am using:

import sagemaker
from sagemaker.huggingface import HuggingFace

git_config = {‘repo’: ‘https://github.com/huggingface/transformers.git’,'branch’: ‘v4.6.1’} # v4.6.1 is referring to the transformers_version you use in the estimator.

gets role for executing training job

role = sagemaker.get_execution_role()
hyperparameters = {
‘model_name_or_path’:‘ktangri/gpt-neo-demo’,
‘output_dir’:’/opt/ml/model’,
‘fp16’: True,
‘train_file’: ‘/opt/ml/input/data/train/train.csv’,
‘validation_file’: ‘/opt/ml/input/data/validation/validation.csv’
# add your remaining hyperparameters
# more info here https://github.com/huggingface/transformers/tree/v4.6.1/examples/language-modeling
}

#configuration for running training on smdistributed Data Parallel
distribution = {‘smdistributed’:{‘dataparallel’:{ ‘enabled’: True}}}

git configuration to download our fine-tuning script

git_config = {‘repo’: ‘https://github.com/huggingface/transformers.git’,'branch’: ‘v4.6.1’}

creates Hugging Face estimator

huggingface_estimator = HuggingFace(
entry_point=‘run_clm.py’,
source_dir=’/examples/language-modeling’,
instance_type=‘ml.p3.16xlarge’,
instance_count=2,
role=role,
git_config=git_config,
transformers_version=‘4.6.1’,
pytorch_version=‘1.7.1’,
py_version=‘py36’,
hyperparameters = hyperparameters
)

huggingface_estimator.fit(
{‘train’: ‘s3://ch-questions-dataset-east1/train/train.csv’,
‘validation’: ‘s3://ch-questions-dataset-east1/validation/validation.csv’}
)

Thank you.

Thanks for pointing this out with transformers 4.6.1 the examples/ structure changed.
It is for source_dir now →

source_dir=’examples/pytorch/language-modeling’,

We’ll fix the code snippet on the hub.

2 Likes

Thank you Philipp. I will make this changes. It will be great if you guys can create a notebook example / post for this type of task.

In terms of data set up, I reviewed the training script and correct me if I am wrong but I did not see any requirements to add start text or end text prefixes to the data? Just a column name text with the input text data?

Thank you again. Great job as always

Hey @Jorgeutd,

Do you mean by adding a prefix to the data something like that?

>>> def add_prefix(example):
...     example['sentence1'] = 'My sentence: ' + example['sentence1']
...     return example
...
>>> updated_dataset = small_dataset.map(add_prefix)
>>> updated_dataset['sentence1'][:5]
['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 "My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
 'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
 'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
]

If you would like to do something like that you could either copy and modify the script or create an additional script for pre-processing and then saving your data as csv with the prefix and then loading it using run_clm.py

Thank you Philipp. That is what ended up doing. I having a lot of issues when I am trying to deploy an endpoint with custom inference py code, and using the:
model_data=‘s3://sagemaker-us-east-1-197614225699/huggingface-pytorch-training-2021-07-22-18-47-50-728/output/model.tar.gz’, # local path where *.targ.gz is saved

I keep getting multiples OSError: [Errno 28] No space left on device which does not aling with the instance space available. I think this is more on the amazon side.

This is the error:

OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
419 with tarfile.open(tmp_model_path, mode=“w:gz”) as t:
→ 420 t.add(model_dir, arcname=os.path.sep)
421

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in add(self, name, arcname, recursive, exclude, filter)
1960 self.add(os.path.join(name, f), os.path.join(arcname, f),
→ 1961 recursive, exclude, filter=filter)
1962

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in add(self, name, arcname, recursive, exclude, filter)
1960 self.add(os.path.join(name, f), os.path.join(arcname, f),
→ 1961 recursive, exclude, filter=filter)
1962

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in add(self, name, arcname, recursive, exclude, filter)
1953 with bltn_open(name, “rb”) as f:
→ 1954 self.addfile(tarinfo, f)
1955

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in addfile(self, tarinfo, fileobj)
1981 if fileobj is not None:
→ 1982 copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
1983 blocks, remainder = divmod(tarinfo.size, BLOCKSIZE)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in copyfileobj(src, dst, length, exception, bufsize)
251 raise exception(“unexpected end of data”)
→ 252 dst.write(buf)
253

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in write(self, data)
263 if length > 0:
→ 264 self.fileobj.write(self.compress.compress(data))
265 self.size += length

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in close(self)
308 if self.mode == WRITE:
→ 309 fileobj.write(self.compress.flush())
310 write32u(fileobj, self.crc)

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
in
5 predictor = model_for_deployment.deploy(initial_instance_count=1,
6 instance_type=“ml.g4dn.2xlarge”,
----> 7 endpoint_name=endpoint_name
8 )

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
709 self._base_name = “-”.join((self._base_name, compiled_model_suffix))
710
→ 711 self._create_sagemaker_model(instance_type, accelerator_type, tags)
712 production_variant = sagemaker.production_variant(
713 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
263 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
264 “”"
→ 265 container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
266
267 self._ensure_base_name_if_needed(container_def[“Image”])

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/huggingface/model.py in prepare_container_def(self, instance_type, accelerator_type)
269
270 deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
→ 271 self._upload_code(deploy_key_prefix, repack=True)
272 deploy_env = dict(self.env)
273 deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
1088 repacked_model_uri=repacked_model_data,
1089 sagemaker_session=self.sagemaker_session,
→ 1090 kms_key=self.model_kms_key,
1091 )
1092

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
418 tmp_model_path = os.path.join(tmp, “temp-model.tar.gz”)
419 with tarfile.open(tmp_model_path, mode=“w:gz”) as t:
→ 420 t.add(model_dir, arcname=os.path.sep)
421
422 _save_model(repacked_model_uri, tmp_model_path, sagemaker_session, kms_key=kms_key)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/tarfile.py in exit(self, type, value, traceback)
2439 # it would try to write end-of-archive blocks and padding.
2440 if not self._extfileobj:
→ 2441 self.fileobj.close()
2442 self.closed = True
2443

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/gzip.py in close(self)
317 if myfileobj:
318 self.myfileobj = None
→ 319 myfileobj.close()
320
321 def flush(self,zlib_mode=zlib.Z_SYNC_FLUSH):

OSError: [Errno 28] No space left on device

Hey @Jorgeutd,
could you provide a few information for of you model.tar.gz.

  • What is the size of the archive?
  • What is the size of the archive unzipped?
  • How does the file structure look like?

cc @dan21c

[quote=“philschmid, post:7, topic:8566”]
could you provide a few information for of you model.tar.gz .

  • What is the size of the archive?
    Size: 8.1 GB

  • What is the size of the archive unzipped?
    Not sure about this. I think the checkpoint have the models saved and optimizer but I can delete the checkpoints.

  • How does the file structure look like?

At this point, I tried different instances configuration to try to deploy with custom inference code and nothing works.

Could try to create the model.tar.gz by hand and only include the “inference” relevant files and leave out the checkpoints? And test it again.

We will look into it.