Fine Tuning GPT-2 - Training job only using test sample size of 5

That has worked! Thank you @marshmellow77 and @philschmid

Just to re-iterate the solution for clarity on this thread:

I had not defined train_file and validation_file correctly in my hyperparameters. By checking in the CloudWatch logs I could see that

SM_CHANNEL_TEST=/opt/ml/input/data/test SM_CHANNEL_TRAIN=/opt/ml/input/data/training

where the folder names for the training instances (i.e. /opt/ml/…/…/test and /opt/ml/…/…/training) come from keys used when calling huggingface_estimator.fit

# Here are the paths to my training and test datasets saved in S3
training_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday.txt'
test_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday_eval.txt'

hyperparameters = {
    'model_name_or_path':'gpt2',
	'output_dir':'/opt/ml/model',
    'train_file' : '/opt/ml/input/data/training/ft_input_data.txt',
    'validation_file': '/opt/ml/input/data/test/ft_input_data_eval.txt',
    'do_train': True,
    'do_eval': True,
    'per_device_eval_batch_size':2,
    'per_device_train_batch_size':2,
    'gradient_accumulation_steps':8}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
    role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1'
)

# starting the train job
huggingface_estimator.fit(inputs={'training': training_input_path,
                                 'test': test_input_path})