Fine Tuning GPT-2 - Training job only using test sample size of 5

Hello,

I have been trying to fine tune GPT-2 for Causal Language Modelling. I have a sample dataset of 320 of which 300 are used for training and 20 used for evaluation.

Once the training job completes the training metrics state that only 5 training sample were used. I am needing to use batch-size of 2 as I run into cuda memory issues otherwise. Are there any other parameters I should be looking to change to increase the sample?

Here is my code below:

hyperparameters = {
    'model_name_or_path':'gpt2',
	'output_dir':'/opt/ml/model',
    'train_file' : 'https://dev-gptj-training.notebook.eu-west-1.sagemaker.aws/edit/input_data/raw_data/ft_input_data.txt',
    'validation_file': 'https://dev-gptj-training.notebook.eu-west-1.sagemaker.aws/edit/input_data/raw_data/ft_input_data_eval.txt',
    'do_train': True,
    'do_eval': True,
    'per_device_eval_batch_size':2,
    'per_device_train_batch_size':2,
    'gradient_accumulation_steps':8}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
    role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1'
)

# starting the train job
huggingface_estimator.fit(inputs={'training':'s3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data_sunday.txt',
                                 'test':'s3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data_sunday_eval.txt'})

and here are the training job logs:

timestamp,message
1675636103415,"[INFO|tokenization_utils_base.py:1786] 2023-02-05 22:28:22,822 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None"
1675636103415,"[INFO|tokenization_utils_base.py:1786] 2023-02-05 22:28:22,822 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None"
1675636103415,"[INFO|configuration_utils.py:648] 2023-02-05 22:28:23,111 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51"
1675636103415,"[INFO|configuration_utils.py:648] 2023-02-05 22:28:23,111 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51"
1675636103415,"[INFO|configuration_utils.py:684] 2023-02-05 22:28:23,112 >> Model config GPT2Config {
  ""_name_or_path"": ""gpt2"",
  ""activation_function"": ""gelu_new"",
  ""architectures"": [
    ""GPT2LMHeadModel""
  ],
  ""attn_pdrop"": 0.1,
  ""bos_token_id"": 50256,
  ""embd_pdrop"": 0.1,
  ""eos_token_id"": 50256,
  ""initializer_range"": 0.02,
  ""layer_norm_epsilon"": 1e-05,
  ""model_type"": ""gpt2"",
  ""n_ctx"": 1024,
  ""n_embd"": 768,
  ""n_head"": 12,
  ""n_inner"": null,
  ""n_layer"": 12,
  ""n_positions"": 1024,
  ""reorder_and_upcast_attn"": false,
  ""resid_pdrop"": 0.1,
  ""scale_attn_by_inverse_layer_idx"": false,
  ""scale_attn_weights"": true,
  ""summary_activation"": null,
  ""summary_first_dropout"": 0.1,
  ""summary_proj_to_labels"": true,
  ""summary_type"": ""cls_index"",
  ""summary_use_proj"": true,
  ""task_specific_params"": {
    ""text-generation"": {
      ""do_sample"": true,
      ""max_length"": 50
    }
  },
  ""transformers_version"": ""4.17.0"",
  ""use_cache"": true,
  ""vocab_size"": 50257"
1675636103415,}
1675636103416,"[INFO|configuration_utils.py:684] 2023-02-05 22:28:23,112 >> Model config GPT2Config {
  ""_name_or_path"": ""gpt2"",
  ""activation_function"": ""gelu_new"",
  ""architectures"": [
    ""GPT2LMHeadModel""
  ],
  ""attn_pdrop"": 0.1,
  ""bos_token_id"": 50256,
  ""embd_pdrop"": 0.1,
  ""eos_token_id"": 50256,
  ""initializer_range"": 0.02,
  ""layer_norm_epsilon"": 1e-05,
  ""model_type"": ""gpt2"",
  ""n_ctx"": 1024,
  ""n_embd"": 768,
  ""n_head"": 12,
  ""n_inner"": null,
  ""n_layer"": 12,
  ""n_positions"": 1024,
  ""reorder_and_upcast_attn"": false,
  ""resid_pdrop"": 0.1,
  ""scale_attn_by_inverse_layer_idx"": false,
  ""scale_attn_weights"": true,
  ""summary_activation"": null,
  ""summary_first_dropout"": 0.1,
  ""summary_proj_to_labels"": true,
  ""summary_type"": ""cls_index"",
  ""summary_use_proj"": true,
  ""task_specific_params"": {
    ""text-generation"": {
      ""do_sample"": true,
      ""max_length"": 50
    }
  },
  ""transformers_version"": ""4.17.0"",
  ""use_cache"": true,
  ""vocab_size"": 50257"
1675636103416,}
1675636104416,"[INFO|file_utils.py:2215] 2023-02-05 22:28:23,492 >> https://huggingface.co/gpt2/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpixj9yloj"
1675636104416,"[INFO|file_utils.py:2215] 2023-02-05 22:28:23,492 >> https://huggingface.co/gpt2/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpixj9yloj"
1675636104416,"Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]"
1675636104416,"Downloading:   1%|          | 5.59M/523M [00:00<00:09, 58.6MB/s]"
1675636104416,"Downloading:   2%|▏         | 11.2M/523M [00:00<00:09, 56.3MB/s]"
1675636104416,"Downloading:   3%|β–Ž         | 16.6M/523M [00:00<00:09, 56.1MB/s]"
1675636104417,"Downloading:   4%|▍         | 22.0M/523M [00:00<00:09, 56.5MB/s]"
1675636104417,"Downloading:   5%|β–Œ         | 27.5M/523M [00:00<00:09, 56.7MB/s]"
1675636104417,"Downloading:   6%|β–‹         | 32.9M/523M [00:00<00:09, 56.9MB/s]"
1675636104417,"Downloading:   7%|β–‹         | 38.5M/523M [00:00<00:08, 57.5MB/s]"
1675636104417,"Downloading:   8%|β–Š         | 44.0M/523M [00:00<00:08, 56.6MB/s]"
1675636105417,"Downloading:  10%|β–‰         | 49.7M/523M [00:00<00:08, 57.7MB/s]"
1675636105417,"Downloading:  11%|β–ˆ         | 55.3M/523M [00:01<00:08, 57.0MB/s]"
1675636105417,"Downloading:  12%|β–ˆβ–        | 60.7M/523M [00:01<00:08, 56.4MB/s]"
1675636105417,"Downloading:  13%|β–ˆβ–Ž        | 66.1M/523M [00:01<00:08, 53.3MB/s]"
1675636105417,"Downloading:  14%|β–ˆβ–Ž        | 71.2M/523M [00:01<00:11, 40.3MB/s]"
1675636105417,"Downloading:  14%|β–ˆβ–        | 75.5M/523M [00:01<00:11, 39.1MB/s]"
1675636105417,"Downloading:  15%|β–ˆβ–Œ        | 79.8M/523M [00:01<00:11, 40.3MB/s]"
1675636105417,"Downloading:  16%|β–ˆβ–Œ        | 83.8M/523M [00:01<00:12, 36.3MB/s]"
1675636106418,"Downloading:  17%|β–ˆβ–‹        | 87.5M/523M [00:01<00:12, 36.4MB/s]"
1675636106418,"Downloading:  17%|β–ˆβ–‹        | 91.1M/523M [00:02<00:12, 36.8MB/s]"
1675636106418,"Downloading:  18%|β–ˆβ–Š        | 95.9M/523M [00:02<00:11, 40.4MB/s]"
1675636106418,"Downloading:  19%|β–ˆβ–‰        | 99.9M/523M [00:02<00:12, 36.5MB/s]"
1675636106418,"Downloading:  20%|β–ˆβ–ˆ        | 106M/523M [00:02<00:09, 44.4MB/s]"
1675636106418,"Downloading:  22%|β–ˆβ–ˆβ–       | 113M/523M [00:02<00:08, 50.7MB/s]"
1675636106418,"Downloading:  23%|β–ˆβ–ˆβ–Ž       | 120M/523M [00:02<00:07, 56.5MB/s]"
1675636106418,"Downloading:  24%|β–ˆβ–ˆβ–       | 125M/523M [00:02<00:08, 51.1MB/s]"
1675636106418,"Downloading:  25%|β–ˆβ–ˆβ–       | 130M/523M [00:02<00:08, 51.2MB/s]"
1675636107418,"Downloading:  26%|β–ˆβ–ˆβ–Œ       | 135M/523M [00:02<00:07, 51.6MB/s]"
1675636107418,"Downloading:  27%|β–ˆβ–ˆβ–‹       | 141M/523M [00:03<00:07, 55.2MB/s]"
1675636107419,"Downloading:  28%|β–ˆβ–ˆβ–Š       | 148M/523M [00:03<00:06, 59.3MB/s]"
1675636107419,"Downloading:  29%|β–ˆβ–ˆβ–‰       | 154M/523M [00:03<00:06, 60.0MB/s]"
1675636107419,"Downloading:  31%|β–ˆβ–ˆβ–ˆ       | 161M/523M [00:03<00:05, 63.7MB/s]"
1675636107419,"Downloading:  32%|β–ˆβ–ˆβ–ˆβ–      | 168M/523M [00:03<00:05, 65.8MB/s]"
1675636107419,"Downloading:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 174M/523M [00:03<00:05, 63.8MB/s]"
1675636107419,"Downloading:  34%|β–ˆβ–ˆβ–ˆβ–      | 180M/523M [00:03<00:05, 63.9MB/s]"
1675636107419,"Downloading:  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 187M/523M [00:03<00:05, 65.9MB/s]"
1675636107419,"Downloading:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 193M/523M [00:03<00:05, 62.7MB/s]"
1675636108419,"Downloading:  38%|β–ˆβ–ˆβ–ˆβ–Š      | 199M/523M [00:03<00:05, 62.6MB/s]"
1675636108419,"Downloading:  39%|β–ˆβ–ˆβ–ˆβ–‰      | 206M/523M [00:04<00:05, 64.0MB/s]"
1675636108419,"Downloading:  41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 212M/523M [00:04<00:06, 50.6MB/s]"
1675636108419,"Downloading:  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 217M/523M [00:04<00:06, 50.0MB/s]"
1675636108419,"Downloading:  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 223M/523M [00:04<00:05, 53.3MB/s]"
1675636108419,"Downloading:  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 229M/523M [00:04<00:05, 54.9MB/s]"
1675636108419,"Downloading:  45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 236M/523M [00:04<00:05, 59.5MB/s]"
1675636108419,"Downloading:  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 242M/523M [00:04<00:04, 60.6MB/s]"
1675636108419,"Downloading:  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 248M/523M [00:04<00:04, 61.1MB/s]"
1675636109420,"Downloading:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 254M/523M [00:04<00:04, 60.7MB/s]"
1675636109420,"Downloading:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 259M/523M [00:05<00:04, 56.5MB/s]"
1675636109420,"Downloading:  51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 266M/523M [00:05<00:04, 59.5MB/s]"
1675636109420,"Downloading:  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 272M/523M [00:05<00:04, 62.2MB/s]"
1675636109420,"Downloading:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 278M/523M [00:05<00:04, 62.7MB/s]"
1675636109420,"Downloading:  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 285M/523M [00:05<00:03, 64.3MB/s]"
1675636109420,"Downloading:  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 292M/523M [00:05<00:03, 65.9MB/s]"
1675636109420,"Downloading:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 298M/523M [00:05<00:03, 65.9MB/s]"
1675636109420,"Downloading:  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 304M/523M [00:05<00:03, 65.0MB/s]"
1675636110421,"Downloading:  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 310M/523M [00:05<00:03, 61.3MB/s]"
1675636110421,"Downloading:  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 316M/523M [00:06<00:03, 57.6MB/s]"
1675636110421,"Downloading:  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 322M/523M [00:06<00:03, 57.9MB/s]"
1675636110421,"Downloading:  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 328M/523M [00:06<00:04, 48.1MB/s]"
1675636110421,"Downloading:  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 334M/523M [00:06<00:03, 53.2MB/s]"
1675636110421,"Downloading:  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 341M/523M [00:06<00:03, 58.2MB/s]"
1675636110421,"Downloading:  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 347M/523M [00:06<00:03, 51.3MB/s]"
1675636110421,"Downloading:  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 354M/523M [00:06<00:03, 56.5MB/s]"
1675636110421,"Downloading:  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 360M/523M [00:06<00:02, 59.9MB/s]"
1675636111421,"Downloading:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 367M/523M [00:06<00:02, 61.7MB/s]"
1675636111422,"Downloading:  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 373M/523M [00:07<00:02, 62.6MB/s]"
1675636111422,"Downloading:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 379M/523M [00:07<00:02, 64.5MB/s]"
1675636111422,"Downloading:  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 386M/523M [00:07<00:02, 64.7MB/s]"
1675636111422,"Downloading:  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 392M/523M [00:07<00:02, 64.4MB/s]"
1675636111422,"Downloading:  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 398M/523M [00:07<00:02, 65.2MB/s]"
1675636111422,"Downloading:  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 405M/523M [00:07<00:01, 64.0MB/s]"
1675636111422,"Downloading:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 411M/523M [00:07<00:01, 60.4MB/s]"
1675636111422,"Downloading:  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 417M/523M [00:07<00:01, 62.8MB/s]"
1675636112422,"Downloading:  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 424M/523M [00:07<00:01, 63.8MB/s]"
1675636112422,"Downloading:  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 430M/523M [00:08<00:01, 63.6MB/s]"
1675636112422,"Downloading:  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 436M/523M [00:08<00:01, 60.5MB/s]"
1675636112422,"Downloading:  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 443M/523M [00:08<00:01, 64.8MB/s]"
1675636112422,"Downloading:  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 449M/523M [00:08<00:01, 64.9MB/s]"
1675636112422,"Downloading:  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 455M/523M [00:08<00:01, 56.8MB/s]"
1675636112422,"Downloading:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 461M/523M [00:08<00:01, 58.2MB/s]"
1675636112422,"Downloading:  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 467M/523M [00:08<00:01, 49.9MB/s]"
1675636112422,"Downloading:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 473M/523M [00:08<00:00, 53.5MB/s]"
1675636113423,"Downloading:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 479M/523M [00:08<00:00, 57.0MB/s]"
1675636113423,"Downloading:  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 486M/523M [00:09<00:00, 61.3MB/s]"
1675636113423,"Downloading:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 493M/523M [00:09<00:00, 63.5MB/s]"
1675636113423,"Downloading:  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 500M/523M [00:09<00:00, 65.2MB/s]"
1675636113423,"Downloading:  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 506M/523M [00:09<00:00, 66.7MB/s]"
1675636113423,"Downloading:  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 514M/523M [00:09<00:00, 69.7MB/s]"
1675636113423,"Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 521M/523M [00:09<00:00, 70.7MB/s]"
1675636113423,"Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 523M/523M [00:09<00:00, 57.3MB/s]"
1675636113423,"[INFO|file_utils.py:2219] 2023-02-05 22:28:33,090 >> storing https://huggingface.co/gpt2/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636113423,"[INFO|file_utils.py:2219] 2023-02-05 22:28:33,090 >> storing https://huggingface.co/gpt2/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636113423,"[INFO|file_utils.py:2227] 2023-02-05 22:28:33,091 >> creating metadata file for /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636113423,"[INFO|file_utils.py:2227] 2023-02-05 22:28:33,091 >> creating metadata file for /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636113423,"[INFO|modeling_utils.py:1431] 2023-02-05 22:28:33,091 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636113423,"[INFO|modeling_utils.py:1431] 2023-02-05 22:28:33,091 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925"
1675636115424,"[INFO|modeling_utils.py:1702] 2023-02-05 22:28:35,010 >> All model checkpoint weights were used when initializing GPT2LMHeadModel."
1675636115424,"[INFO|modeling_utils.py:1710] 2023-02-05 22:28:35,010 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2."
1675636115424,"If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training."
1675636115424,"[INFO|modeling_utils.py:1702] 2023-02-05 22:28:35,010 >> All model checkpoint weights were used when initializing GPT2LMHeadModel."
1675636115424,"[INFO|modeling_utils.py:1710] 2023-02-05 22:28:35,010 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2."
1675636115424,"If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training."
1675636115424,"02/05/2023 22:28:35 - WARNING - datasets.fingerprint - Parameter 'function'=<function main.<locals>.tokenize_function at 0x7f1e0ad49ee0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed."
1675636115425,"Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]"
1675636115425,"[WARNING|tokenization_utils_base.py:3397] 2023-02-05 22:28:35,087 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1226 > 1024). Running this sequence through the model will result in indexing errors"
1675636115425,"[WARNING|tokenization_utils_base.py:3397] 2023-02-05 22:28:35,087 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1226 > 1024). Running this sequence through the model will result in indexing errors"
1675636115425,"[WARNING|run_clm.py:378] 2023-02-05 22:28:35,087 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
1675636115425,"[WARNING|run_clm.py:378] 2023-02-05 22:28:35,087 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
1675636115425,02/05/2023 22:28:35 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-5872c4bdb0144370/0.0.0/08f6fb1dd2dab0a18ea441c359e1d63794ea8cb53e7863e6edf8fc5655e47ec4/cache-1c80317fa3b1799d.arrow
1675636115425,"Running tokenizer on dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 13.01ba/s]"
1675636115425,"02/05/2023 22:28:35 - INFO - datasets.fingerprint - Parameter 'function'=<function main.<locals>.tokenize_function at 0x7f1e0ad3ea60> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead."
1675636115425,"Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]"
1675636115425,02/05/2023 22:28:35 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-5872c4bdb0144370/0.0.0/08f6fb1dd2dab0a18ea441c359e1d63794ea8cb53e7863e6edf8fc5655e47ec4/cache-bdd640fb06671ad1.arrow
1675636115425,"Running tokenizer on dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 106.55ba/s]"
1675636115425,"02/05/2023 22:28:35 - INFO - datasets.fingerprint - Parameter 'function'=<function main.<locals>.group_texts at 0x7f1e0ad49ee0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead."
1675636115425,"Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]"
1675636115425,02/05/2023 22:28:35 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-5872c4bdb0144370/0.0.0/08f6fb1dd2dab0a18ea441c359e1d63794ea8cb53e7863e6edf8fc5655e47ec4/cache-3eb13b9046685257.arrow
1675636115425,"Grouping texts in chunks of 1024: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 98.70ba/s]"
1675636115425,"02/05/2023 22:28:35 - INFO - datasets.fingerprint - Parameter 'function'=<function main.<locals>.group_texts at 0x7f1e0ad49ee0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead."
1675636115425,"Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]"
1675636115425,02/05/2023 22:28:35 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/text/default-5872c4bdb0144370/0.0.0/08f6fb1dd2dab0a18ea441c359e1d63794ea8cb53e7863e6edf8fc5655e47ec4/cache-23b8c1e9392456de.arrow
1675636115425,"Grouping texts in chunks of 1024: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 100.88ba/s]"
1675636116425,"02/05/2023 22:28:35 - INFO - datasets.utils.file_utils - https://raw.githubusercontent.com/huggingface/datasets/1.18.4/metrics/accuracy/accuracy.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpat4_lqt5"
1675636116425,"Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]"
1675636116425,"Downloading: 3.19kB [00:00, 2.15MB/s]"
1675636116426,02/05/2023 22:28:35 - INFO - datasets.utils.file_utils - storing https://raw.githubusercontent.com/huggingface/datasets/1.18.4/metrics/accuracy/accuracy.py in cache at /root/.cache/huggingface/datasets/downloads/18ec2a1ed9dbcfd6ecff70a4f0d0d33fd5cc40c51c3c816376dc3d0b3e30219f.6913c0dc30de3cef9d6bc88cc182661800cb937f0fe5b01ffa731617105a32ac.py
1675636116426,02/05/2023 22:28:35 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/18ec2a1ed9dbcfd6ecff70a4f0d0d33fd5cc40c51c3c816376dc3d0b3e30219f.6913c0dc30de3cef9d6bc88cc182661800cb937f0fe5b01ffa731617105a32ac.py
1675636120427,"/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn("
1675636120427,"[INFO|trainer.py:1279] 2023-02-05 22:28:40,375 >> ***** Running training *****"
1675636120427,"[INFO|trainer.py:1279] 2023-02-05 22:28:40,375 >> ***** Running training *****"
1675636120427,"[INFO|trainer.py:1280] 2023-02-05 22:28:40,375 >>   Num examples = 5"
1675636120427,"[INFO|trainer.py:1281] 2023-02-05 22:28:40,375 >>   Num Epochs = 3"
1675636120427,"[INFO|trainer.py:1282] 2023-02-05 22:28:40,375 >>   Instantaneous batch size per device = 2"
1675636120427,"[INFO|trainer.py:1280] 2023-02-05 22:28:40,375 >>   Num examples = 5"
1675636120427,"[INFO|trainer.py:1281] 2023-02-05 22:28:40,375 >>   Num Epochs = 3"
1675636120427,"[INFO|trainer.py:1282] 2023-02-05 22:28:40,375 >>   Instantaneous batch size per device = 2"
1675636120427,"[INFO|trainer.py:1283] 2023-02-05 22:28:40,375 >>   Total train batch size (w. parallel, distributed & accumulation) = 16"
1675636120427,"[INFO|trainer.py:1284] 2023-02-05 22:28:40,375 >>   Gradient Accumulation steps = 8"
1675636120427,"[INFO|trainer.py:1285] 2023-02-05 22:28:40,375 >>   Total optimization steps = 3"
1675636120427,"[INFO|trainer.py:1283] 2023-02-05 22:28:40,375 >>   Total train batch size (w. parallel, distributed & accumulation) = 16"
1675636120427,"[INFO|trainer.py:1284] 2023-02-05 22:28:40,375 >>   Gradient Accumulation steps = 8"
1675636120427,"[INFO|trainer.py:1285] 2023-02-05 22:28:40,375 >>   Total optimization steps = 3"
1675636120427,"0%|          | 0/3 [00:00<?, ?it/s]"
1675636121428,[2023-02-05 22:28:40.545 algo-1:49 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
1675636121428,[2023-02-05 22:28:40.716 algo-1:49 INFO profiler_config_parser.py:111] User has disabled profiler.
1675636121428,[2023-02-05 22:28:40.717 algo-1:49 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
1675636121428,[2023-02-05 22:28:40.718 algo-1:49 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
1675636121428,[2023-02-05 22:28:40.718 algo-1:49 INFO hook.py:254] Saving to /opt/ml/output/tensors
1675636121428,[2023-02-05 22:28:40.718 algo-1:49 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
1675636122428,"33%|β–ˆβ–ˆβ–ˆβ–Ž      | 1/3 [00:01<00:03,  1.56s/it]"
1675636123429,"67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 2/3 [00:02<00:00,  1.03it/s]"
1675636123429,"100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:02<00:00,  1.28it/s]"
1675636123429,"[INFO|trainer.py:1508] 2023-02-05 22:28:43,047 >> "
1675636123429,Training completed. Do not forget to share your model on huggingface.co/models =)
1675636123429,"[INFO|trainer.py:1508] 2023-02-05 22:28:43,047 >> "
1675636123429,Training completed. Do not forget to share your model on huggingface.co/models =)
1675636123429,"{'train_runtime': 2.6716, 'train_samples_per_second': 5.615, 'train_steps_per_second': 1.123, 'train_loss': 1.36589781443278, 'epoch': 3.0}"
1675636123429,"100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:02<00:00,  1.28it/s]"
1675636123429,"100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:02<00:00,  1.12it/s]"
1675636123429,"[INFO|trainer.py:2139] 2023-02-05 22:28:43,048 >> Saving model checkpoint to /opt/ml/model"
1675636123429,"[INFO|trainer.py:2139] 2023-02-05 22:28:43,048 >> Saving model checkpoint to /opt/ml/model"
1675636123429,"[INFO|configuration_utils.py:439] 2023-02-05 22:28:43,049 >> Configuration saved in /opt/ml/model/config.json"
1675636123429,"[INFO|configuration_utils.py:439] 2023-02-05 22:28:43,049 >> Configuration saved in /opt/ml/model/config.json"
1675636124429,"[INFO|modeling_utils.py:1084] 2023-02-05 22:28:43,964 >> Model weights saved in /opt/ml/model/pytorch_model.bin"
1675636124430,"[INFO|modeling_utils.py:1084] 2023-02-05 22:28:43,964 >> Model weights saved in /opt/ml/model/pytorch_model.bin"
1675636124430,"[INFO|tokenization_utils_base.py:2094] 2023-02-05 22:28:43,965 >> tokenizer config file saved in /opt/ml/model/tokenizer_config.json"
1675636124430,"[INFO|tokenization_utils_base.py:2094] 2023-02-05 22:28:43,965 >> tokenizer config file saved in /opt/ml/model/tokenizer_config.json"
1675636124430,"[INFO|tokenization_utils_base.py:2100] 2023-02-05 22:28:43,965 >> Special tokens file saved in /opt/ml/model/special_tokens_map.json"
1675636124430,"[INFO|tokenization_utils_base.py:2100] 2023-02-05 22:28:43,965 >> Special tokens file saved in /opt/ml/model/special_tokens_map.json"
1675636124430,***** train metrics *****
1675636124430,"epoch                    =        3.0
  train_loss               =     1.3659
  train_runtime            = 0:00:02.67
  train_samples            =          5
  train_samples_per_second =      5.615
  train_steps_per_second   =      1.123"
1675636124430,02/05/2023 22:28:44 - INFO - __main__ - *** Evaluate ***
1675636124430,"[INFO|trainer.py:2389] 2023-02-05 22:28:44,077 >> ***** Running Evaluation *****"
1675636124430,"[INFO|trainer.py:2389] 2023-02-05 22:28:44,077 >> ***** Running Evaluation *****"
1675636124430,"[INFO|trainer.py:2391] 2023-02-05 22:28:44,077 >>   Num examples = 5"
1675636124430,"[INFO|trainer.py:2394] 2023-02-05 22:28:44,077 >>   Batch size = 2"
1675636124430,"[INFO|trainer.py:2391] 2023-02-05 22:28:44,077 >>   Num examples = 5"
1675636124430,"[INFO|trainer.py:2394] 2023-02-05 22:28:44,077 >>   Batch size = 2"
1675636124430,"0%|          | 0/3 [00:00<?, ?it/s]"
1675636124430,"100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 26.31it/s]"
1675636124430,02/05/2023 22:28:44 - INFO - datasets.metric - Removing /root/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow
1675636124430,"100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 20.81it/s]"
1675636124430,***** eval metrics *****
1675636124430,"epoch                   =        3.0
  eval_accuracy           =     0.6016
  eval_loss               =     2.4612
  eval_runtime            = 0:00:00.21
  eval_samples            =          5
  eval_samples_per_second =     22.761
  eval_steps_per_second   =     13.657
  perplexity              =    11.7194"
1675636125431,"[INFO|modelcard.py:460] 2023-02-05 22:28:44,666 >> Dropping the following result as it does not have all the necessary fields:"
1675636125431,"{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.601564027370479}]}"
1675636125431,"[INFO|modelcard.py:460] 2023-02-05 22:28:44,666 >> Dropping the following result as it does not have all the necessary fields:"
1675636125431,"{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.601564027370479}]}"
1675636125431,"2023-02-05 22:28:45,239 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code."
1675636125431,"2023-02-05 22:28:45,239 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process."
1675636125431,"2023-02-05 22:28:45,240 sagemaker-training-toolkit INFO     Reporting training SUCCESS"

I am pretty sure the managed training job cannot access these files, since those are URLs protected behind a notebook. You need to pass the training files into the managed job as done here: https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb and then provide the β€œpath” as train_file hyperparameter, e.g. /opt/ml/input/train/dataset.txt

1 Like

Ok thanks - I have had a look and changed my train_file and validation_file to the s3 location. I am now running into this error when I run the code (I have changed the bucket path name when posting here)

InvalidSchema: No connection adapters were found for 's3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data_sunday.txt'

I am quite confused with this, here is how I am uploading my dataset to S3, is it because I am not using the datasets.filesystems method of saving my data which is why it isn’t working?

# Create an S3 client
s3 = boto3.client('s3')

# Define the S3 bucket and prefix where you want to save the dataset
s3_bucket_name = "1111111111111-dev-gpt2-datasets"
s3_bucket_path ='gpt-2/datasets/'

s3.upload_file('input_data/raw_data/ft_input_data.txt', s3_bucket_name, s3_bucket_path+'ft_input_data.txt')
s3.upload_file('input_data/raw_data/ft_input_data_eval.txt', s3_bucket_name, s3_bucket_path+'ft_input_data_eval.txt')

hyperparameters = {
    'model_name_or_path':'gpt2',
	'output_dir':'/opt/ml/model',
    'train_file' : 's3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data.txt',
    'validation_file': 's3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data.txt',
    'do_train': True,
    'do_eval': True,
    'per_device_eval_batch_size':2,
    'per_device_train_batch_size':2,
    'gradient_accumulation_steps':8}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
    role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1'
)

# starting the train job
huggingface_estimator.fit(inputs={'training':'s3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data.txt',
                                 'test':'s3://1111111111111-dev-gpt2-datasets/gpt-2/datasets/ft_input_data_eval.txt'})

Hi Eilidh

This is indeed very confusing at first, but it stems from the fact that the S3 locations will be mapped to local folders in the file system of the training job instance. @philschmid already solved this problem for another user here. You just need to adapt your hyperparameters for the training and validation files :slight_smile:

Cheers
Heiko

That has worked! Thank you @marshmellow77 and @philschmid

Just to re-iterate the solution for clarity on this thread:

I had not defined train_file and validation_file correctly in my hyperparameters. By checking in the CloudWatch logs I could see that

SM_CHANNEL_TEST=/opt/ml/input/data/test SM_CHANNEL_TRAIN=/opt/ml/input/data/training

where the folder names for the training instances (i.e. /opt/ml/…/…/test and /opt/ml/…/…/training) come from keys used when calling huggingface_estimator.fit

# Here are the paths to my training and test datasets saved in S3
training_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday.txt'
test_input_path = 's3://1111111111111-dev-gpt2-datasets/opt/ml/input/ft_input_data_sunday_eval.txt'

hyperparameters = {
    'model_name_or_path':'gpt2',
	'output_dir':'/opt/ml/model',
    'train_file' : '/opt/ml/input/data/training/ft_input_data.txt',
    'validation_file': '/opt/ml/input/data/test/ft_input_data_eval.txt',
    'do_train': True,
    'do_eval': True,
    'per_device_eval_batch_size':2,
    'per_device_train_batch_size':2,
    'gradient_accumulation_steps':8}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
    role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1'
)

# starting the train job
huggingface_estimator.fit(inputs={'training': training_input_path,
                                 'test': test_input_path})