Hey. I was able to figure it out. Since the aws/sagemaker/trainingjobs folder was just created, i had to restart the sagemaker instance altogether for the logs to be saved. This is the full error log
p (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-06-22 18:10:59,048 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2021-06-22 18:10:59,071 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2021-06-22 18:11:02,099 sagemaker_pytorch_container.training INFO Invoking user training script.
2021-06-22 18:11:02,525 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Requirement already satisfied: datasets>=1.1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.1.91)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.17.1)
Collecting rouge-score
Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting nltk
Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
Collecting py7zr
Downloading py7zr-0.16.1-py3-none-any.whl (65 kB)
Requirement already satisfied: torch>=1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 7)) (1.7.1)
Requirement already satisfied: dill in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: dataclasses in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.49.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.0)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.25.1)
Requirement already satisfied: pyarrow>=1.0.0<4.0.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (20.9)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.6/site-packages (from torch>=1.3->-r requirements.txt (line 7)) (3.10.0.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from huggingface-hub<0.1.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.25.11)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf->-r requirements.txt (line 3)) (1.16.0)
Collecting absl-py
Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
Requirement already satisfied: joblib in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (1.0.1)
Requirement already satisfied: click in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (7.1.2)
Requirement already satisfied: regex in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 5)) (2021.4.4)
Collecting pyppmd>=0.14.0
Downloading pyppmd-0.15.0-cp36-cp36m-manylinux2014_x86_64.whl (120 kB)
Collecting pyzstd<0.15.0,>=0.14.4
Downloading pyzstd-0.14.4-cp36-cp36m-manylinux2014_x86_64.whl (2.2 MB)
Collecting multivolumefile>=0.2.3
Downloading multivolumefile-0.2.3-py3-none-any.whl (17 kB)
Collecting brotli>=1.0.9
Downloading Brotli-1.0.9-cp36-cp36m-manylinux1_x86_64.whl (357 kB)
Collecting texttable
Downloading texttable-1.6.3-py2.py3-none-any.whl (10 kB)
Collecting bcj-cffi<0.6.0,>=0.5.1
Downloading bcj_cffi-0.5.1-cp36-cp36m-manylinux2014_x86_64.whl (36 kB)
Collecting pycryptodomex>=3.6.6
Downloading pycryptodomex-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
Requirement already satisfied: cffi>=1.14.0 in /opt/conda/lib/python3.6/site-packages (from bcj-cffi<0.6.0,>=0.5.1->py7zr->-r requirements.txt (line 6)) (1.14.5)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.6/site-packages (from cffi>=1.14.0->bcj-cffi<0.6.0,>=0.5.1->py7zr->-r requirements.txt (line 6)) (2.20)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging->datasets>=1.1.3->-r requirements.txt (line 1)) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.1)
Installing collected packages: texttable, pyzstd, pyppmd, pycryptodomex, nltk, multivolumefile, brotli, bcj-cffi, absl-py, rouge-score, py7zr
Successfully installed absl-py-0.13.0 bcj-cffi-0.5.1 brotli-1.0.9 multivolumefile-0.2.3 nltk-3.6.2 py7zr-0.16.1 pycryptodomex-3.10.1 pyppmd-0.15.0 pyzstd-0.14.4 rouge-score-0.0.4 texttable-1.6.3
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
2021-06-22 18:11:08,362 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"test": "/opt/ml/input/data/test",
"validation": "/opt/ml/input/data/validation",
"train": "/opt/ml/input/data/train"
},
"current_host": "algo-1",
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"evaluation_strategy": "steps",
"per_device_eval_batch_size": 2,
"load_best_model_at_end": true,
"max_steps": 200,
"max_source_length": 500,
"validation_file": "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
"text_column": "document",
"do_eval": true,
"output_dir": "/opt/ml/model",
"eval_steps": 200,
"max_grad_norm": 1,
"fp16": true,
"max_target_length": 100,
"weight_decay": 0.01,
"do_train": true,
"test_file": "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
"train_file": "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
"per_device_train_batch_size": 2,
"learning_rate": 2e-05,
"model_name_or_path": "google/pegasus-large",
"summary_column": "summary"
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"test": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
},
"validation": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
},
"train": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "huggingface-pytorch-training-2021-06-22-18-03-56-300",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz",
"module_name": "run_summarization_original",
"network_interface_name": "eth0",
"num_cpus": 8,
"num_gpus": 1,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "run_summarization_original.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"do_eval":true,"do_train":true,"eval_steps":200,"evaluation_strategy":"steps","fp16":true,"learning_rate":2e-05,"load_best_model_at_end":true,"max_grad_norm":1,"max_source_length":500,"max_steps":200,"max_target_length":100,"model_name_or_path":"google/pegasus-large","output_dir":"/opt/ml/model","per_device_eval_batch_size":2,"per_device_train_batch_size":2,"summary_column":"summary","test_file":"/opt/ml/input/data/test/final_aws_deepgram_test.csv","text_column":"document","train_file":"/opt/ml/input/data/train/final_aws_deepgram_train.csv","validation_file":"/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","weight_decay":0.01}
SM_USER_ENTRY_POINT=run_summarization_original.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"validation":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train","validation"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_summarization_original
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train","validation":"/opt/ml/input/data/validation"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"do_eval":true,"do_train":true,"eval_steps":200,"evaluation_strategy":"steps","fp16":true,"learning_rate":2e-05,"load_best_model_at_end":true,"max_grad_norm":1,"max_source_length":500,"max_steps":200,"max_target_length":100,"model_name_or_path":"google/pegasus-large","output_dir":"/opt/ml/model","per_device_eval_batch_size":2,"per_device_train_batch_size":2,"summary_column":"summary","test_file":"/opt/ml/input/data/test/final_aws_deepgram_test.csv","text_column":"document","train_file":"/opt/ml/input/data/train/final_aws_deepgram_train.csv","validation_file":"/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","weight_decay":0.01},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"validation":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2021-06-22-18-03-56-300","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://qfn-transcription/huggingface-pytorch-training-2021-06-22-18-03-56-300/source/sourcedir.tar.gz","module_name":"run_summarization_original","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"run_summarization_original.py"}
SM_USER_ARGS=["--do_eval","True","--do_train","True","--eval_steps","200","--evaluation_strategy","steps","--fp16","True","--learning_rate","2e-05","--load_best_model_at_end","True","--max_grad_norm","1","--max_source_length","500","--max_steps","200","--max_target_length","100","--model_name_or_path","google/pegasus-large","--output_dir","/opt/ml/model","--per_device_eval_batch_size","2","--per_device_train_batch_size","2","--summary_column","summary","--test_file","/opt/ml/input/data/test/final_aws_deepgram_test.csv","--text_column","document","--train_file","/opt/ml/input/data/train/final_aws_deepgram_train.csv","--validation_file","/opt/ml/input/data/validation/final_aws_deepgram_validation.csv","--weight_decay","0.01"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_EVALUATION_STRATEGY=steps
SM_HP_PER_DEVICE_EVAL_BATCH_SIZE=2
SM_HP_LOAD_BEST_MODEL_AT_END=true
SM_HP_MAX_STEPS=200
SM_HP_MAX_SOURCE_LENGTH=500
SM_HP_VALIDATION_FILE=/opt/ml/input/data/validation/final_aws_deepgram_validation.csv
SM_HP_TEXT_COLUMN=document
SM_HP_DO_EVAL=true
SM_HP_OUTPUT_DIR=/opt/ml/model
SM_HP_EVAL_STEPS=200
SM_HP_MAX_GRAD_NORM=1
SM_HP_FP16=true
SM_HP_MAX_TARGET_LENGTH=100
SM_HP_WEIGHT_DECAY=0.01
SM_HP_DO_TRAIN=true
SM_HP_TEST_FILE=/opt/ml/input/data/test/final_aws_deepgram_test.csv
SM_HP_TRAIN_FILE=/opt/ml/input/data/train/final_aws_deepgram_train.csv
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
SM_HP_LEARNING_RATE=2e-05
SM_HP_MODEL_NAME_OR_PATH=google/pegasus-large
SM_HP_SUMMARY_COLUMN=summary
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 run_summarization_original.py --do_eval True --do_train True --eval_steps 200 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 200 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01
06/22/2021 18:11:14 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
06/22/2021 18:11:14 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/opt/ml/model', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=200, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Jun22_18-11-13_algo-1', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=200, dataloader_num_workers=0, past_index=-1, run_name='/opt/ml/model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=True, metric_for_best_model='loss', greater_is_better=False, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name='length', report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, mp_parameters='', sortish_sampler=False, predict_with_generate=False)
Traceback (most recent call last):
File "run_summarization_original.py", line 606, in <module>
main()
File "run_summarization_original.py", line 325, in main
datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
**config_kwargs,
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
**config_kwargs,
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
m.update(str(os.path.getmtime(data_file)))
File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/validation/final_aws_deepgram_validation.csv'
2021-06-22 18:11:15,009 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_summarization_original.py --do_eval True --do_train True --eval_steps 200 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 200 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01"
Traceback (most recent call last):
File "run_summarization_original.py", line 606, in <module>
main()
File "run_summarization_original.py", line 325, in main
datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
**config_kwargs,
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
**config_kwargs,
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
m.update(str(os.path.getmtime(data_file)))
File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/validation/final_aws_deepgram_validation.csv'
I checked the previous post you made on this and it does seem that this should be the right directory where the files are stored. would you know why this might be happening @OlivierCR ? the links I provided in the .fit() function are links that I checked by just calling a pd.read_csv() function within the sagemaker instance. And they seemed to be working as well. Thanks.