That has worked! Thank you @marshmellow77 and @philschmid
Just to re-iterate the solution for clarity on this thread:
I had not defined train_file
and validation_file
correctly in my hyperparameters. By checking in the CloudWatch logs I could see that
SM_CHANNEL_TEST=/opt/ml/input/data/test SM_CHANNEL_TRAIN=/opt/ml/input/data/training
where the folder names for the training instances (i.e. /opt/ml/…/…/test and /opt/ml/…/…/training) come from keys used when calling huggingface_estimator.fit
# Here are the paths to my training and test datasets saved in S3
training_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday.txt'
test_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday_eval.txt'
hyperparameters = {
'model_name_or_path':'gpt2',
'output_dir':'/opt/ml/model',
'train_file' : '/opt/ml/input/data/training/ft_input_data.txt',
'validation_file': '/opt/ml/input/data/test/ft_input_data_eval.txt',
'do_train': True,
'do_eval': True,
'per_device_eval_batch_size':2,
'per_device_train_batch_size':2,
'gradient_accumulation_steps':8}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}
# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_clm.py',
source_dir='./examples/pytorch/language-modeling',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
hyperparameters = hyperparameters,
output_path = output_bucket,
base_job_name = 'GPT2-v1'
)
# starting the train job
huggingface_estimator.fit(inputs={'training': training_input_path,
'test': test_input_path})