Using custom csv data with run_summarization.py in sagemaker

Hi -

I am trying to use the sagemaker Huggingface estimator to fine-tune a model for summarization using the run_summarization.py entry_point. I have created a Sagemaker Studio notebook based on the code from the summarization example notebook provided in the Sagemaker examples.

I would like to use my own data to train the model and so have added the following code to make train and validation datasets I have uploaded to s3 storage available to the estimator:

Define s3 locations:

training_input_path = "s3://sagemaker-eu-central-1-88888888888/train_20210607.csv"
test_input_path = "s3://sagemaker-eu-central-1-88888888888/val_20210607.csv"

Define file locations in hyperparameters:

hyperparameters={
    ...,
    'train_file': '/opt/ml/input/data/train_20210607.csv',
    'validation_file': '/opt/ml/input/data/val_20210607.csv',
    'text_column': 'document',
    'summary_column': 'summary',
    ...
}

Ensure data is loaded when starting training job:
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

It looks from the log that the data has been loaded into the expected directory:

SM_HP_VALIDATION_FILE=/opt/ml/input/data/val_20210607.csv
SM_HP_TRAIN_FILE=/opt/ml/input/data/train_20210607.csv

However, after the run_summarization.py script is run, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/train_20210607.csv'

Apologies if I am missing something obvious, but would be great if anyone could let me know how I should be referencing my data so that it can be used by the run_summarization.py script?

Thank you!

Ben

Hey @benG,

Do you have the full error log? like where the error is thrown in the run_summarization.py.
And which transformers_version are you using and how does your git_config looks like?

Hi @philschmid -

Thank you for getting back to me so quickly. The full log is as follows:

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-06-16 10:28:28,198 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-06-16 10:28:28,223 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-06-16 10:28:28,230 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-06-16 10:28:28,542 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Requirement already satisfied: datasets>=1.1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.1.91)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.17.1)
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting nltk
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.1)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.49.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: dill in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)
Requirement already satisfied: packaging in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (20.9)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: dataclasses in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.4.0)
Requirement already satisfied: pyarrow>=1.0.0<4.0.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.1)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.25.1)
Collecting portalocker==2.0.0
  Downloading portalocker-2.0.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from huggingface-hub<0.1.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.12)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.30)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf->-r requirements.txt (line 3)) (1.16.0)
Collecting absl-py
  Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
Requirement already satisfied: click in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 6)) (7.1.2)
Requirement already satisfied: regex in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 6)) (2021.4.4)
Requirement already satisfied: joblib in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 6)) (1.0.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.10.0.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging->datasets>=1.1.3->-r requirements.txt (line 1)) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.1)
Installing collected packages: portalocker, nltk, absl-py, sacrebleu, rouge-score
Successfully installed absl-py-0.13.0 nltk-3.6.2 portalocker-2.0.0 rouge-score-0.0.4 sacrebleu-1.5.1
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

2021-06-16 10:28:32,546 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/test",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "per_device_eval_batch_size": 1,
        "seed": 8,
        "validation_file": "/opt/ml/input/data/val_20210607.csv",
        "do_train": true,
        "text_column": "document",
        "num_train_epochs": 2,
        "do_eval": true,
        "train_file": "/opt/ml/input/data/train_20210607.csv",
        "warmup_steps": 500,
        "save_steps": 500,
        "output_dir": "/opt/ml/model",
        "eval_steps": 500,
        "per_device_train_batch_size": 1,
        "learning_rate": 5e-05,
        "logging_steps": 500,
        "model_name_or_path": "patrickvonplaten/longformer2roberta-cnn_dailymail-fp16",
        "summary_column": "summary",
        "fp16": true
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "test": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "job-20210616-01-2021-06-16-10-02-17-542",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-central-1-8888888888/job-20210616-01-2021-06-16-10-02-17-542/source/sourcedir.tar.gz",
    "module_name": "run_summarization",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_summarization.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"do_eval":true,"do_train":true,"eval_steps":500,"fp16":true,"learning_rate":5e-05,"logging_steps":500,"model_name_or_path":"patrickvonplaten/longformer2roberta-cnn_dailymail-fp16","num_train_epochs":2,"output_dir":"/opt/ml/model","per_device_eval_batch_size":1,"per_device_train_batch_size":1,"save_steps":500,"seed":8,"summary_column":"summary","text_column":"document","train_file":"/opt/ml/input/data/train_20210607.csv","validation_file":"/opt/ml/input/data/val_20210607.csv","warmup_steps":500}
SM_USER_ENTRY_POINT=run_summarization.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_summarization
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-88888888888/job-20210616-01-2021-06-16-10-02-17-542/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"do_eval":true,"do_train":true,"eval_steps":500,"fp16":true,"learning_rate":5e-05,"logging_steps":500,"model_name_or_path":"patrickvonplaten/longformer2roberta-cnn_dailymail-fp16","num_train_epochs":2,"output_dir":"/opt/ml/model","per_device_eval_batch_size":1,"per_device_train_batch_size":1,"save_steps":500,"seed":8,"summary_column":"summary","text_column":"document","train_file":"/opt/ml/input/data/train_20210607.csv","validation_file":"/opt/ml/input/data/val_20210607.csv","warmup_steps":500},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"job-20210616-01-2021-06-16-10-02-17-542","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-central-1-88888888888/job-20210616-01-2021-06-16-10-02-17-542/source/sourcedir.tar.gz","module_name":"run_summarization","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"run_summarization.py"}
SM_USER_ARGS=["--do_eval","True","--do_train","True","--eval_steps","500","--fp16","True","--learning_rate","5e-05","--logging_steps","500","--model_name_or_path","patrickvonplaten/longformer2roberta-cnn_dailymail-fp16","--num_train_epochs","2","--output_dir","/opt/ml/model","--per_device_eval_batch_size","1","--per_device_train_batch_size","1","--save_steps","500","--seed","8","--summary_column","summary","--text_column","document","--train_file","/opt/ml/input/data/train_20210607.csv","--validation_file","/opt/ml/input/data/val_20210607.csv","--warmup_steps","500"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_PER_DEVICE_EVAL_BATCH_SIZE=1
SM_HP_SEED=8
SM_HP_VALIDATION_FILE=/opt/ml/input/data/val_20210607.csv
SM_HP_DO_TRAIN=true
SM_HP_TEXT_COLUMN=document
SM_HP_NUM_TRAIN_EPOCHS=2
SM_HP_DO_EVAL=true
SM_HP_TRAIN_FILE=/opt/ml/input/data/train_20210607.csv
SM_HP_WARMUP_STEPS=500
SM_HP_SAVE_STEPS=500
SM_HP_OUTPUT_DIR=/opt/ml/model
SM_HP_EVAL_STEPS=500
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=1
SM_HP_LEARNING_RATE=5e-05
SM_HP_LOGGING_STEPS=500
SM_HP_MODEL_NAME_OR_PATH=patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
SM_HP_SUMMARY_COLUMN=summary
SM_HP_FP16=true
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 run_summarization.py --do_eval True --do_train True --eval_steps 500 --fp16 True --learning_rate 5e-05 --logging_steps 500 --model_name_or_path patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 --num_train_epochs 2 --output_dir /opt/ml/model --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --save_steps 500 --seed 8 --summary_column summary --text_column document --train_file /opt/ml/input/data/train_20210607.csv --validation_file /opt/ml/input/data/val_20210607.csv --warmup_steps 500


06/16/2021 10:28:37 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
06/16/2021 10:28:37 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/opt/ml/model', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=500, logging_dir='runs/Jun16_10-28-37_algo-1', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=None, no_cuda=False, seed=8, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/opt/ml/model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name='length', report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, mp_parameters='', sortish_sampler=False, predict_with_generate=False)
Traceback (most recent call last):
  File "run_summarization.py", line 595, in <module>
    main()
  File "run_summarization.py", line 326, in main
    datasets = load_dataset(extension, data_files=data_files)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
    config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
    m.update(str(os.path.getmtime(data_file)))
  File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/train_20210607.csv'

2021-06-16 10:28:38,206 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_summarization.py --do_eval True --do_train True --eval_steps 500 --fp16 True --learning_rate 5e-05 --logging_steps 500 --model_name_or_path patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 --num_train_epochs 2 --output_dir /opt/ml/model --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --save_steps 500 --seed 8 --summary_column summary --text_column document --train_file /opt/ml/input/data/train_20210607.csv --validation_file /opt/ml/input/data/val_20210607.csv --warmup_steps 500"
Traceback (most recent call last):
  File "run_summarization.py", line 595, in <module>
    main()
  File "run_summarization.py", line 326, in main
    datasets = load_dataset(extension, data_files=data_files)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 737, in load_dataset
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 237, in __init__
    **config_kwargs,
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 348, in _create_builder_config
    config_id = builder_config.create_config_id(config_kwargs, custom_features=custom_features)
  File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 153, in create_config_id
    m.update(str(os.path.getmtime(data_file)))
  File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/train_20210607.csv'

2021-06-16 10:29:02 Uploading - Uploading generated training model
2021-06-16 10:29:02 Failed - Training job failed
ProfilerReport-1623837737: Stopping

The git_config is as follows:

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.4.2'}

And the estimator is defined as follows:

huggingface_estimator = HuggingFace(
    entry_point='run_summarization.py',
	source_dir='./examples/seq2seq',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	base_job_name=job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    use_spot_instances=True,
    max_wait=7200, # This should be equal to or greater than max_run in seconds'
    max_run=3600, # expected max run in seconds [Try this with 10 hour to start]
    transformers_version='4.6',
    pytorch_version='1.6',
    py_version='py36',
    hyperparameters = hyperparameters)

Thank you again!

Ben

I could reproduce your problem and it is coming from the hyperparameter definition. The files will be saved in the job to the following directories

SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train

so
'train_file' '/opt/ml/input/data/train/train_20210607.csv' and
'validation_file''/opt/ml/input/data/test/val_20210607.csv'

The environment variables SM_HP_VALIDATION_FILE and SM_HP_TRAIN_FILE are representing the values from the hyperparameter dict and not where the files are stored.

2 Likes

@benG Philip is correct, the keys you use in the input dictionnary {'key1': 's3://...', ..., 'keyN': s3://...'} become local folders names in SageMaker Training instances, respectively

/opt/ml/input/data/key1/
...
/opt/ml/input/data/keyN/

so it seems you only missed to add those key names (train and test) when reading data within the SM Training instance

reference: SageMaker Training documentation How Amazon SageMaker Provides Training Information - Amazon SageMaker

1 Like