Running custom data files on run_summarization.py

Hi there,

I have been running a script to train a pretrained transformer on a summarization task. I am using custom data which I have put into my S3 bucket which is also the default bucket for this job.

I have been getting this error in return, and have not been able to figure out what the solution is. I have run the exact same script on the xsum dataset just to see if it’s the custom dataset that is the issue, and I can indeed confirm that the job works when using the xsum dataset.

from sagemaker.huggingface import HuggingFace

hyperparameters={
    'model_name_or_path': 'google/pegasus-large',
    'train_file': "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
    'test_file': "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
    'validation_file': "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
    'text_column': 'document',
    'summary_column': 'summary',
    'do_train': True,
    'do_eval': True,
    'fp16': True,
    'per_device_train_batch_size': 2,
    'per_device_eval_batch_size': 2,
    'evaluation_strategy': "steps",
    'eval_steps': 200,
    'weight_decay': 0.01,
    'learning_rate': 2e-5,
    'max_grad_norm': 1,
    'max_steps': 200,
    'max_source_length': 500,
    'max_target_length': 100,
    'load_best_model_at_end': True,
    'output_dir': '/opt/ml/model'
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git', 'branch': 'v4.6.1'} #'branch': 'v4.6.1'

# instance configurations
instance_type='ml.p3.2xlarge'
instance_count=1
volume_size=200

# metric definition to extract the results
metric_definitions=[
     {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
     {'Name': 'train_samples_per_second', 'Regex': "train_samples_per_second.*=\D*(.*?)$"}
]
huggingface_estimator = HuggingFace(entry_point='run_summarization_original.py',
                                    source_dir='transformers/examples/pytorch/summarization',
                                    git_config=git_config,
                                    metric_definitions=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.6.1',
                                    pytorch_version='1.7.1',
                                    py_version='py36',
                                    hyperparameters = hyperparameters)
# starting the train job
huggingface_estimator.fit(
  {'train': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_train.csv',
   'test': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_test.csv',
  'validation': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_validate.csv'}
)
2021-06-22 15:34:39 Starting - Starting the training job...
2021-06-22 15:35:03 Starting - Launching requested ML instancesProfilerReport-1624376073: InProgress
.........
2021-06-22 15:36:33 Starting - Preparing the instances for training.........
2021-06-22 15:38:06 Downloading - Downloading input data
2021-06-22 15:38:06 Training - Downloading the training image.....................
2021-06-22 15:41:34 Uploading - Uploading generated training model
2021-06-22 15:41:34 Failed - Training job failed
..
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-ca8819244de5> in <module>
      3   {'train': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_train.csv',
      4    'test': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_test.csv',
----> 5   'validation': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_validate.csv'}
      6 )

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1623         # If logs are requested, call logs_for_jobs.
   1624         if logs != "None":
-> 1625             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1626         else:
   1627             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3683 
   3684         if wait:
-> 3685             self._check_job_status(job_name, description, "TrainingJobStatus")
   3686             if dot:
   3687                 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3243                 ),
   3244                 allowed_statuses=["Completed", "Stopped"],
-> 3245                 actual_status=status,
   3246             )
   3247 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-06-22-15-34-33-634: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_summarization_original.py --do_eval True --do_train True --eval_steps 200 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 200 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01"
Traceback (most recent call last):
  File "run_summarization_original.py", line 606, in <module>
    main()
  File "run_summarization_original.py", line 325, in main
    datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
  File "/opt/conda/lib/p

Thanks!

Hey @ujjirox,

Is the error log you attached the whole error log? They should be more, maybe in cloudwatch.
Additionally

2021-06-22 15:34:39 Starting - Starting the training job...
2021-06-22 15:35:03 Starting - Launching requested ML instancesProfilerReport-1624376073: InProgress
.........
2021-06-22 15:36:33 Starting - Preparing the instances for training.........
2021-06-22 15:38:06 Downloading - Downloading input data
2021-06-22 15:38:06 Training - Downloading the training image.....................
2021-06-22 15:41:34 Uploading - Uploading generated training model
2021-06-22 15:41:34 Failed - Training job failed

→ shows that you ran training for 4minutes and also SageMaker tried to upload your model. Maybe there was an issue with uploading the model.

Hey Phil,

I tried going on Cloudwatch and it seems like there is no cloud stream being created for this particular training job. Do you happen to know how I might be able to obtain the whole error log? Sorry for the very beginner question.

Thanks.

You can go to the SageMaker Dashboard → Training → Training Jobs → select your jobs → there should be a link view logs

Hey Phil,

It seems that when I click on view logs, there are no logs associated with this training job.

There seems to be a folder being saved into the qfn-transcript bucket which is the default bucket. That seems to contain some logs but not the full error log.

Thanks.

Okay i never have seen this before? Are you in the same region?
Could try to rerun the job?

Hmm. I just tried rerunning the job and no luck… Still nothing. Any suggestions on what I should try next?

@OlivierCR any idea why sagemaker is not creating logs?

Just want to add, the IAM Policy does allow me to createlogstreams. I have that permission enabled.

the error seems to be in here right?

datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

is File "/opt/conda/lib/p the end of the error message?

I think Phil was correct when he said that is not the full error message. It seems that only the first 1024 characters are returned and so that’s why it cuts off like that. I also thought there might be an issue with the load_dataset function but am now thinking that might not be it, particularly because the training does seem to be happening. It seems at the upload stage, there is an error. All the more peculiar because I was able to run a training job on the xsum dataset with the exact same training configuration. So if there is a problem in the upload, it seems to be isolated to custom datasets.

@ujjirox did you manage to run that Python code locally out of SageMaker?

Hey. I was able to figure it out. Since the aws/sagemaker/trainingjobs folder was just created, i had to restart the sagemaker instance altogether for the logs to be saved. This is the full error log

p (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-06-22 18:10:59,048 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-06-22 18:10:59,071 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-06-22 18:11:02,099 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-06-22 18:11:02,525 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Requirement already satisfied: datasets>=1.1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.1.91)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.17.1)
Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting nltk
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
Collecting py7zr
  Downloading py7zr-0.16.1-py3-none-any.whl (65 kB)
Requirement already satisfied: torch>=1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 7)) (1.7.1)
Requirement already satisfied: dill in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: dataclasses in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.49.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.0)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.25.1)
Requirement already satisfied: pyarrow>=1.0.0<4.0.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (20.9)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.6/site-packages (from torch>=1.3->-r requirements.txt (line 7)) (3.10.0.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from huggingface-hub<0.1.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.25.11)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf->-r requirements.txt (line 3)) (1.16.0)
Collecting absl-py
  Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
Requirement already satisfied: joblib in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (1.0.1)
Requirement already satisfied: click in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (7.1.2)
Requirement already satisfied: regex in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (2021.4.4)
Collecting pyppmd>=0.14.0
  Downloading pyppmd-0.15.0-cp36-cp36m-manylinux2014_x86_64.whl (120 kB)
Collecting pyzstd<0.15.0,>=0.14.4
  Downloading pyzstd-0.14.4-cp36-cp36m-manylinux2014_x86_64.whl (2.2 MB)
Collecting multivolumefile>=0.2.3
  Downloading multivolumefile-0.2.3-py3-none-any.whl (17 kB)
Collecting brotli>=1.0.9
  Downloading Brotli-1.0.9-cp36-cp36m-manylinux1_x86_64.whl (357 kB)
Collecting texttable
  Downloading texttable-1.6.3-py2.py3-none-any.whl (10 kB)
Collecting bcj-cffi<0.6.0,>=0.5.1
  Downloading bcj_cffi-0.5.1-cp36-cp36m-manylinux2014_x86_64.whl (36 kB)
Collecting pycryptodomex>=3.6.6
  Downloading pycryptodomex-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
Requirement already satisfied: cffi>=1.14.0 in /opt/conda/lib/python3.6/site-packages (from bcj-cffi<0.6.0,>=0.5.1->py7zr->-r requirements.txt (line 6)) (1.14.5)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.6/site-packages (from cffi>=1.14.0->bcj-cffi<0.6.0,>=0.5.1->py7zr->-r requirements.txt (line 6)) (2.20)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging->datasets>=1.1.3->-r requirements.txt (line 1)) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.1)
Installing collected packages: texttable, pyzstd, pyppmd, pycryptodomex, nltk, multivolumefile, brotli, bcj-cffi, absl-py, rouge-score, py7zr
Successfully installed absl-py-0.13.0 bcj-cffi-0.5.1 brotli-1.0.9 multivolumefile-0.2.3 nltk-3.6.2 py7zr-0.16.1 pycryptodomex-3.10.1 pyppmd-0.15.0 pyzstd-0.14.4 rouge-score-0.0.4 texttable-1.6.3
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

2021-06-22 18:11:08,362 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/test",
        "validation": "/opt/ml/input/data/validation",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "evaluation_strategy": "steps",
        "per_device_eval_batch_size": 2,
        "load_best_model_at_end": true,
        "max_steps": 200,
        "max_source_length": 500,
        "validation_file": "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
        "text_column": "document",
        "do_eval": true,
        "output_dir": "/opt/ml/model",
        "eval_steps": 200,
        "max_grad_norm": 1,
        "fp16": true,
        "max_target_length": 100,
        "weight_decay": 0.01,
        "do_train": true,
        "test_file": "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
        "train_file": "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
        "per_device_train_batch_size": 2,
        "learning_rate": 2e-05,
        "model_name_or_path": "google/pegasus-large",
        "summary_column": "summary"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "test": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "validation": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2021-06-22-18-03-56-300",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz",
    "module_name": "run_summarization_original",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_summarization_original.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"do_eval":true,"do_train":true,"eval_steps":200,"evaluation_strategy":"steps","fp16":true,"learning_rate":2e-05,"load_best_model_at_end":true,"max_grad_norm":1,"max_source_length":500,"max_steps":200,"max_target_length":100,"model_name_or_path":"google/pegasus-large","output_dir":"/opt/ml/model","per_device_eval_batch_size":2,"per_device_train_batch_size":2,"summary_column":"summary","test_file":"/opt/ml/input/data/test/final_aws_deepgram_test.csv","text_column":"document","train_file":"/opt/ml/input/data/train/final_aws_deepgram_train.csv","validation_file":"/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","weight_decay":0.01}
SM_USER_ENTRY_POINT=run_summarization_original.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"validation":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train","validation"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_summarization_original
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train","validation":"/opt/ml/input/data/validation"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"do_eval":true,"do_train":true,"eval_steps":200,"evaluation_strategy":"steps","fp16":true,"learning_rate":2e-05,"load_best_model_at_end":true,"max_grad_norm":1,"max_source_length":500,"max_steps":200,"max_target_length":100,"model_name_or_path":"google/pegasus-large","output_dir":"/opt/ml/model","per_device_eval_batch_size":2,"per_device_train_batch_size":2,"summary_column":"summary","test_file":"/opt/ml/input/data/test/final_aws_deepgram_test.csv","text_column":"document","train_file":"/opt/ml/input/data/train/final_aws_deepgram_train.csv","validation_file":"/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","weight_decay":0.01},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"validation":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2021-06-22-18-03-56-300","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz","module_name":"run_summarization_original","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"run_summarization_original.py"}
SM_USER_ARGS=["--do_eval","True","--do_train","True","--eval_steps","200","--evaluation_strategy","steps","--fp16","True","--learning_rate","2e-05","--load_best_model_at_end","True","--max_grad_norm","1","--max_source_length","500","--max_steps","200","--max_target_length","100","--model_name_or_path","google/pegasus-large","--output_dir","/opt/ml/model","--per_device_eval_batch_size","2","--per_device_train_batch_size","2","--summary_column","summary","--test_file","/opt/ml/input/data/test/final_aws_deepgram_test.csv","--text_column","document","--train_file","/opt/ml/input/data/train/final_aws_deepgram_train.csv","--validation_file","/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","--weight_decay","0.01"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_EVALUATION_STRATEGY=steps
SM_HP_PER_DEVICE_EVAL_BATCH_SIZE=2
SM_HP_LOAD_BEST_MODEL_AT_END=true
SM_HP_MAX_STEPS=200
SM_HP_MAX_SOURCE_LENGTH=500
SM_HP_VALIDATION_FILE=/opt/ml/input/data/validation/final_aws_deepgram_validation.csv
SM_HP_TEXT_COLUMN=document
SM_HP_DO_EVAL=true
SM_HP_OUTPUT_DIR=/opt/ml/model
SM_HP_EVAL_STEPS=200
SM_HP_MAX_GRAD_NORM=1
SM_HP_FP16=true
SM_HP_MAX_TARGET_LENGTH=100
SM_HP_WEIGHT_DECAY=0.01
SM_HP_DO_TRAIN=true
SM_HP_TEST_FILE=/opt/ml/input/data/test/final_aws_deepgram_test.csv
SM_HP_TRAIN_FILE=/opt/ml/input/data/train/final_aws_deepgram_train.csv
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
SM_HP_LEARNING_RATE=2e-05
SM_HP_MODEL_NAME_OR_PATH=google/pegasus-large
SM_HP_SUMMARY_COLUMN=summary
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 run_summarization_original.py --do_eval True --do_train True --eval_steps 200 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 200 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01


06/22/2021 18:11:14 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
06/22/2021 18:11:14 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/opt/ml/model', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=200, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Jun22_18-11-13_algo-1', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=200, dataloader_num_workers=0, past_index=-1, run_name='/opt/ml/model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=True, metric_for_best_model='loss', greater_is_better=False, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name='length', report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, mp_parameters='', sortish_sampler=False, predict_with_generate=False)
Traceback (most recent call last):
  File "run_summarization_original.py", line 606, in <module>
    main()
  File "run_summarization_original.py", line 325, in main
    datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
    config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
    m.update(str(os.path.getmtime(data_file)))
  File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/validation/final_aws_deepgram_validation.csv'

2021-06-22 18:11:15,009 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_summarization_original.py --do_eval True --do_train True --eval_steps 200 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 200 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01"
Traceback (most recent call last):
  File "run_summarization_original.py", line 606, in <module>
    main()
  File "run_summarization_original.py", line 325, in main
    datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
    config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
    m.update(str(os.path.getmtime(data_file)))
  File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/validation/final_aws_deepgram_validation.csv'


I checked the previous post you made on this and it does seem that this should be the right directory where the files are stored. would you know why this might be happening @OlivierCR ? the links I provided in the .fit() function are links that I checked by just calling a pd.read_csv() function within the sagemaker instance. And they seemed to be working as well. Thanks.

I ran the code out of a sagemaker notebook instance to answer your question.

Figured it out! Sorry for the confusion.

final_aws_deepgram_validation.csv vs final_aws_deepgram_validate.csv ? :slight_smile:

Yup. Really stupid on my part. Sorry.