Sagemaker gpt-j train file error

import sagemaker
from sagemaker.huggingface import HuggingFace

# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
    'epochs': 1,
    'train_batch_size': 128,
	'model_name_or_path':'EleutherAI/gpt-j-6B',
	'output_dir':'/opt/ml/model'
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.6.1/examples/pytorch/language-modeling
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit({'training': 's3://domain-gen-data/domain-gen-training.jsonl'})

above is the code and below is the error

2021-08-31 07:31:41 Starting - Starting the training job...
2021-08-31 07:32:07 Starting - Launching requested ML instancesProfilerReport-1630395096: InProgress
......
2021-08-31 07:33:08 Starting - Preparing the instances for training......
2021-08-31 07:34:08 Downloading - Downloading input data...
2021-08-31 07:34:28 Training - Downloading the training image..................
2021-08-31 07:37:33 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-08-31 07:37:34,584 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-08-31 07:37:34,615 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-08-31 07:37:36,036 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-08-31 07:37:36,481 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Requirement already satisfied: datasets>=1.1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.1.91)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.17.1)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)
Requirement already satisfied: pyarrow>=1.0.0<4.0.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.1)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.49.0)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: packaging in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (20.9)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.1)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.0)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.25.1)
Requirement already satisfied: dataclasses in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)
Requirement already satisfied: dill in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from huggingface-hub<0.1.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.25.11)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf->-r requirements.txt (line 3)) (1.16.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.10.0.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging->datasets>=1.1.3->-r requirements.txt (line 1)) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.1)
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

2021-08-31 07:37:39,041 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "train_batch_size": 128,
        "output_dir": "/opt/ml/model",
        "epochs": 1,
        "model_name_or_path": "EleutherAI/gpt-j-6B"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2021-08-31-07-31-36-059",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-765248384165/huggingface-pytorch-training-2021-08-31-07-31-36-059/source/sourcedir.tar.gz",
    "module_name": "run_clm",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_clm.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":1,"model_name_or_path":"EleutherAI/gpt-j-6B","output_dir":"/opt/ml/model","train_batch_size":128}
SM_USER_ENTRY_POINT=run_clm.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_clm
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-765248384165/huggingface-pytorch-training-2021-08-31-07-31-36-059/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":1,"model_name_or_path":"EleutherAI/gpt-j-6B","output_dir":"/opt/ml/model","train_batch_size":128},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2021-08-31-07-31-36-059","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-765248384165/huggingface-pytorch-training-2021-08-31-07-31-36-059/source/sourcedir.tar.gz","module_name":"run_clm","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"run_clm.py"}
SM_USER_ARGS=["--epochs","1","--model_name_or_path","EleutherAI/gpt-j-6B","--output_dir","/opt/ml/model","--train_batch_size","128"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_TRAIN_BATCH_SIZE=128
SM_HP_OUTPUT_DIR=/opt/ml/model
SM_HP_EPOCHS=1
SM_HP_MODEL_NAME_OR_PATH=EleutherAI/gpt-j-6B
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 run_clm.py --epochs 1 --model_name_or_path EleutherAI/gpt-j-6B --output_dir /opt/ml/model --train_batch_size 128


Traceback (most recent call last):
  File "run_clm.py", line 468, in <module>
    main()
  File "run_clm.py", line 182, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/opt/conda/lib/python3.6/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 12, in __init__
  File "run_clm.py", line 161, in __post_init__
    raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.

2021-08-31 07:37:44,472 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_clm.py --epochs 1 --model_name_or_path EleutherAI/gpt-j-6B --output_dir /opt/ml/model --train_batch_size 128"
Traceback (most recent call last):
  File "run_clm.py", line 468, in <module>
    main()
  File "run_clm.py", line 182, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/opt/conda/lib/python3.6/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 12, in __init__
  File "run_clm.py", line 161, in __post_init__
    raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.

2021-08-31 07:37:49 Uploading - Uploading generated training model
2021-08-31 07:37:49 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-5-33c7f1decb60> in <module>
     31 
     32 # starting the train job
---> 33 huggingface_estimator.fit({'training': 's3://domain-gen-data/domain-gen-training.jsonl'})

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1625         # If logs are requested, call logs_for_jobs.
   1626         if logs != "None":
-> 1627             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1628         else:
   1629             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3731 
   3732         if wait:
-> 3733             self._check_job_status(job_name, description, "TrainingJobStatus")
   3734             if dot:
   3735                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3291                 ),
   3292                 allowed_statuses=["Completed", "Stopped"],
-> 3293                 actual_status=status,
   3294             )
   3295 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-31-07-31-36-059: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_clm.py --epochs 1 --model_name_or_path EleutherAI/gpt-j-6B --output_dir /opt/ml/model --train_batch_size 128"
Traceback (most recent call last):
  File "run_clm.py", line 468, in <module>
    main()
  File "run_clm.py", line 182, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/opt/conda/lib/python3.6/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 12, in __init__
  File "run_clm.py", line 161, in __post_init__
    raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.

Hello @danurahul,

Thanks for opening the thread. EleutherAI/gpt-j-6B is not yet trainable with Amazon SageMaker, since the PR is not yet merged into transformers for GPT-J and when it is merged, we need to update the DLC or you have to include the new version of transformers in the requirements.txt.

In addition to this is GPT-J-6B 22GB big and won’t fit on a singe ml.p3.2xlarge instance it would be possible to train it, when merged with distributed training, see Run training on Amazon SageMaker

Also even adjusting the two mentioned points above your script would still not work since you are missing a few hyperparameters. The most crucial on is train_file which should have your input file as value, in your case it would be:
/opt/ml/input/data/training/domain-gen-training.jsonl

getting out of memory error with distillgpt as well as gptneo-125 M in ml.p3.2xlarge AWS Sagemaker instance
code:

import sagemaker
from sagemaker.huggingface import HuggingFace
# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
    'per_device_train_batch_size': 16,
    'train_file': '/opt/ml/input/data/training/domain-gen-training.csv',
	'model_name_or_path':'distilgpt2',
	'output_dir':'/opt/ml/model',
    'do_train': True,
    'do_eval': False,
    #'data_dir':'/opt/ml/input/data/training'
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters,
)

# starting the train job
huggingface_estimator.fit({'training': 's3://domain-gen-data/domain-gen-training.csv'})

error:

[INFO|trainer.py:1156] 2021-08-31 10:43:00,451 >> ***** Running training *****
[INFO|trainer.py:1157] 2021-08-31 10:43:00,452 >>   Num examples = 13083
[INFO|trainer.py:1158] 2021-08-31 10:43:00,452 >>   Num Epochs = 3
[INFO|trainer.py:1159] 2021-08-31 10:43:00,452 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:1160] 2021-08-31 10:43:00,452 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1161] 2021-08-31 10:43:00,452 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1162] 2021-08-31 10:43:00,453 >>   Total optimization steps = 1227
#015  0%|          | 0/1227 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_clm.py", line 468, in <module>
    main()
  File "run_clm.py", line 421, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1272, in train
    tr_loss += self.training_step(model, inputs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1734, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1766, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 954, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 797, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 354, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 282, in forward
    hidden_states = self.act(hidden_states)
  File "/opt/conda/lib/python3.6/site-packages/transformers/activations.py", line 42, in gelu_new
[INFO|trainer.py:1156] 2021-08-31 10:43:00,451 >> ***** Running training *****
[INFO|trainer.py:1157] 2021-08-31 10:43:00,452 >>   Num examples = 13083
[INFO|trainer.py:1158] 2021-08-31 10:43:00,452 >>   Num Epochs = 3
[INFO|trainer.py:1159] 2021-08-31 10:43:00,452 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:1160] 2021-08-31 10:43:00,452 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1161] 2021-08-31 10:43:00,452 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1162] 2021-08-31 10:43:00,453 >>   Total optimization steps = 1227
#015  0%|          | 0/1227 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_clm.py", line 468, in <module>
    main()
  File "run_clm.py", line 421, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1272, in train
    tr_loss += self.training_step(model, inputs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1734, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1766, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 954, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 797, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 354, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.78 GiB total capacity; 14.66 GiB already allocated; 156.75 MiB free; 14.70 GiB reserved in total by PyTorch)
#015  0%|          | 0/1227 [00:02<?, ?it/s]
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 756, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 282, in forward
    hidden_states = self.act(hidden_states)
  File "/opt/conda/lib/python3.6/site-packages/transformers/activations.py", line 42, in gelu_new
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.78 GiB total capacity; 14.66 GiB already allocated; 156.75 MiB free; 14.70 GiB reserved in total by PyTorch)
#015  0%|          | 0/1227 [00:02<?, ?it/s]


2021-08-31 10:43:11 Uploading - Uploading generated training model
2021-08-31 10:43:11 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-4-80ef9c507021> in <module>
     31 
     32 # starting the train job
---> 33 huggingface_estimator.fit({'training': 's3://domain-gen-data/domain-gen-training.csv'})

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1625         # If logs are requested, call logs_for_jobs.
   1626         if logs != "None":
-> 1627             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1628         else:
   1629             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3731 
   3732         if wait:
-> 3733             self._check_job_status(job_name, description, "TrainingJobStatus")
   3734             if dot:
   3735                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3291                 ),
   3292                 allowed_statuses=["Completed", "Stopped"],
-> 3293                 actual_status=status,
   3294             )
   3295 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-31-10-34-40-424: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 run_clm.py --do_eval False --do_train True --model_name_or_path distilgpt2 --output_dir /opt/ml/model --per_device_train_batch_size 32 --train_file /opt/ml/input/data/training/domain-gen-training.csv"

0 tables [00:00, ? tables/s]
6 tables [00:00, 53.43 tables/s]
15 tables [00:00, 59.52 tables/s]
24 tables [00:00, 64.73 tables/s]
33 tables [00:00, 68.73 tables/s]
42 tables [00:00, 72.00 tables/s]
50 tables [00:00, 73.93 tables/s]
59 tables [00:00, 75.86 tables/s]
                                 
[INFO|file_utils.py:1532] 2021-08-31 10:41:10,579 >> https://huggingface.co/distilgpt2/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpx0ysz91j

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]
Downloading: 100%|██████████| 762/762 [00:00<00:00, 677kB/s]
[INFO|file_utils.py:1536] 2021-08-31 10:41:10,600 >> storing https://huggingface.co/

Hey @danurahul,

can you please provide the full logs?
If it is an OOM for Cuda. Then your per_device_train_batch_size is too big for your dataset. How big is your dataset?

dataset is only 53 mb

Okay strange, could you please share the whole cloudwatch output?

here it is

timestamp message
1630418777743 Generating a 2048 bit RSA private key
1630418777743 …+++
1630418777743 …+++
1630418777743 writing new private key to ‘/home/ec2-user/.jupyter/notebookkey.key’
1630418777743 -----
1630418777743 + echo ‘Self-signed certificate generated.’
1630418777743 + echo ‘Launching Jupyter server with csrf check enabled…’
1630418777743 + exec su -s /bin/sh -l -c ‘source activate JupyterSystemEnv && exec “0" "@”’ ec2-user – jupyter notebook --notebook-dir=/home/ec2-user/SageMaker/ --ip=0.0.0.0 --NotebookApp.token=[REDACTED]
1630418777743 [I 14:05:09.190 NotebookApp] Using EnvironmentKernelSpecManager…
1630418777743 [I 14:05:09.190 NotebookApp] Started periodic updates of the kernel list (every 3 minutes).
1630418777743 [I 14:05:11.691 NotebookApp] Writing notebook server cookie secret to /home/ec2-user/.local/share/jupyter/runtime/notebook_cookie_secret
1630418777743 [W 14:05:14.575 NotebookApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.
1630418777743 [I 14:05:19.422 NotebookApp] JupyterLab extension loaded from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/jupyterlab
1630418777743 [I 14:05:19.422 NotebookApp] JupyterLab application directory is /home/ec2-user/anaconda3/envs/JupyterSystemEnv/share/jupyter/lab
1630418777743 [I 14:05:21.724 NotebookApp] [nb_conda] enabled
1630418777743 [I 14:05:28.092 NotebookApp] sparkmagic extension enabled!
1630418777743 [I 14:05:28.092 NotebookApp] Serving notebooks from local directory: /home/ec2-user/SageMaker
1630418777743 [I 14:05:28.092 NotebookApp] The Jupyter Notebook is running at:
1630418777743 [I 14:05:28.092 NotebookApp] https://(ip-172-16-6-49 or 127.0.0.1):8443/
1630418777743 [I 14:05:28.092 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
1630418777743 [W 14:05:28.130 NotebookApp] No web browser found: could not locate runnable browser.
1630418777743 [I 14:05:28.132 NotebookApp] Starting initial scan of virtual environments…
1630418783041 [I 14:06:06.178 NotebookApp] Found new kernels in environments: conda_chainer_p27, conda_amazonei_mxnet_p36, conda_chainer_p36, conda_amazonei_pytorch_latest_p36, conda_mxnet_latest_p37, conda_tensorflow_p36, conda_python3, conda_amazonei_tensorflow2_p36, conda_amazonei_tensorflow_p27, conda_pytorch_p27, conda_mxnet_p36, conda_amazonei_tensorflow_p36, conda_tensorflow2_p36, conda_mxnet_p27, conda_tensorflow_p27, conda_amazonei_mxnet_p27, conda_pytorch_latest_p36, conda_amazonei_tensorflow2_p27, conda_python2, conda_pytorch_p36
1630418834520 [I 14:07:14.275 NotebookApp] Saving file at /GPT-j.ipynb
1630418838743 [W 14:07:14.287 NotebookApp] Notebook GPT-j.ipynb is not trusted
1630418890743 [I 14:08:06.513 NotebookApp] Build is up to date
1630419004359 [I 14:10:04.244 NotebookApp] Writing notebook-signing key to /home/ec2-user/.local/share/jupyter/notebook_secret
1630419007361 [W 14:10:04.245 NotebookApp] Notebook GPT-j.ipynb is not trusted
1630419009418 [I 14:10:07.252 NotebookApp] Kernel started: bb7b7103-1590-439c-bca6-9a3540c05147
1630419009418 [W 14:10:09.138 NotebookApp] No session ID specified
1630419009418 [W 14:10:09.158 NotebookApp] No session ID specified
1630419009418 [W 14:10:09.177 NotebookApp] No session ID specified
1630419009612 [W 14:10:09.251 NotebookApp] No session ID specified
1630419009862 [W 14:10:09.575 NotebookApp] No session ID specified
1630419009862 [W 14:10:09.750 NotebookApp] No session ID specified
1630419009862 [W 14:10:09.763 NotebookApp] No session ID specified
1630419014743 [W 14:10:09.771 NotebookApp] No session ID specified
1630419031743 Cloning into ‘/tmp/tmpsf8lobqj’…
1630419033124 Note: checking out ‘v4.6.1’.
1630419033124 You are in ‘detached HEAD’ state. You can look around, make experimental
1630419033124 changes and commit them, and you can discard any commits you make in this
1630419033124 state without impacting any branches by performing another checkout.
1630419033124 If you want to create a new branch to retain commits you create, you may
1630419033124 do so (now or later) by using -b with the checkout command again. Example: git checkout -b
1630419037743 HEAD is now at fb27b276e Release: v4.6.1
1630419125423 [I 14:12:05.321 NotebookApp] Saving file at /GPT-j.ipynb
1630419129743 [W 14:12:05.321 NotebookApp] Notebook GPT-j.ipynb is not trusted
1630419246237 [I 14:14:06.126 NotebookApp] Saving file at /GPT-j.ipynb
1630419250743 [W 14:14:06.126 NotebookApp] Notebook GPT-j.ipynb is not trusted

I was referring to the Cloudwatch logs of the SageMaker Training Job, not of your notebook.

sorry about that where can I find them there is only Jupiter.log files

When you go into the AWS management console → Amazon SageMaker → Training → Training Jobs → open the job → scroll down → view logs

timestamp,message
1630410832704,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 11:53:52,698 >> Special tokens file saved in /opt/ml/model/checkpoint-8500/special_tokens_map.json"
1630410914726,"{‘loss’: 2.9294, ‘learning_rate’: 3.853473973859207e-05, ‘epoch’: 0.69}"
1630410914726,"[INFO|trainer.py:1885] 2021-08-31 11:55:13,941 >> Saving model checkpoint to /opt/ml/model/checkpoint-9000"
1630410914726,"[INFO|configuration_utils.py:351] 2021-08-31 11:55:13,943 >> Configuration saved in /opt/ml/model/checkpoint-9000/config.json"
1630410915726,"[INFO|modeling_utils.py:889] 2021-08-31 11:55:14,871 >> Model weights saved in /opt/ml/model/checkpoint-9000/pytorch_model.bin"
1630410915726,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 11:55:14,872 >> tokenizer config file saved in /opt/ml/model/checkpoint-9000/tokenizer_config.json"
1630410915726,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 11:55:14,873 >> Special tokens file saved in /opt/ml/model/checkpoint-9000/special_tokens_map.json"
1630410995789,"{‘loss’: 2.9154, ‘learning_rate’: 3.789778083518052e-05, ‘epoch’: 0.73}"
1630410995789,"[INFO|trainer.py:1885] 2021-08-31 11:56:35,764 >> Saving model checkpoint to /opt/ml/model/checkpoint-9500"
1630410995789,"[INFO|configuration_utils.py:351] 2021-08-31 11:56:35,766 >> Configuration saved in /opt/ml/model/checkpoint-9500/config.json"
1630410996790,"[INFO|modeling_utils.py:889] 2021-08-31 11:56:36,712 >> Model weights saved in /opt/ml/model/checkpoint-9500/pytorch_model.bin"
1630410996790,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 11:56:36,713 >> tokenizer config file saved in /opt/ml/model/checkpoint-9500/tokenizer_config.json"
1630410996790,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 11:56:36,714 >> Special tokens file saved in /opt/ml/model/checkpoint-9500/special_tokens_map.json"
1630411078812,"{‘loss’: 2.9035, ‘learning_rate’: 3.726082193176896e-05, ‘epoch’: 0.76}"
1630411078813,"[INFO|trainer.py:1885] 2021-08-31 11:57:57,906 >> Saving model checkpoint to /opt/ml/model/checkpoint-10000"
1630411078813,"[INFO|configuration_utils.py:351] 2021-08-31 11:57:57,908 >> Configuration saved in /opt/ml/model/checkpoint-10000/config.json"
1630411079813,"[INFO|modeling_utils.py:889] 2021-08-31 11:57:58,838 >> Model weights saved in /opt/ml/model/checkpoint-10000/pytorch_model.bin"
1630411079813,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 11:57:58,839 >> tokenizer config file saved in /opt/ml/model/checkpoint-10000/tokenizer_config.json"
1630411079813,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 11:57:58,840 >> Special tokens file saved in /opt/ml/model/checkpoint-10000/special_tokens_map.json"
1630411160835,"{‘loss’: 2.8886, ‘learning_rate’: 3.6623863028357415e-05, ‘epoch’: 0.8}"
1630411160835,"[INFO|trainer.py:1885] 2021-08-31 11:59:20,191 >> Saving model checkpoint to /opt/ml/model/checkpoint-10500"
1630411160835,"[INFO|configuration_utils.py:351] 2021-08-31 11:59:20,192 >> Configuration saved in /opt/ml/model/checkpoint-10500/config.json"
1630411161835,"[INFO|modeling_utils.py:889] 2021-08-31 11:59:21,143 >> Model weights saved in /opt/ml/model/checkpoint-10500/pytorch_model.bin"
1630411161836,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 11:59:21,143 >> tokenizer config file saved in /opt/ml/model/checkpoint-10500/tokenizer_config.json"
1630411161836,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 11:59:21,144 >> Special tokens file saved in /opt/ml/model/checkpoint-10500/special_tokens_map.json"
1630411242864,"{‘loss’: 2.8759, ‘learning_rate’: 3.598690412494586e-05, ‘epoch’: 0.84}"
1630411242864,"[INFO|trainer.py:1885] 2021-08-31 12:00:42,066 >> Saving model checkpoint to /opt/ml/model/checkpoint-11000"
1630411242864,"[INFO|configuration_utils.py:351] 2021-08-31 12:00:42,067 >> Configuration saved in /opt/ml/model/checkpoint-11000/config.json"
1630411243865,"[INFO|modeling_utils.py:889] 2021-08-31 12:00:42,995 >> Model weights saved in /opt/ml/model/checkpoint-11000/pytorch_model.bin"
1630411243865,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:00:42,996 >> tokenizer config file saved in /opt/ml/model/checkpoint-11000/tokenizer_config.json"
1630411243865,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:00:42,996 >> Special tokens file saved in /opt/ml/model/checkpoint-11000/special_tokens_map.json"
1630411324887,"{‘loss’: 2.8684, ‘learning_rate’: 3.5349945221534306e-05, ‘epoch’: 0.88}"
1630411324887,"[INFO|trainer.py:1885] 2021-08-31 12:02:04,269 >> Saving model checkpoint to /opt/ml/model/checkpoint-11500"
1630411324887,"[INFO|configuration_utils.py:351] 2021-08-31 12:02:04,270 >> Configuration saved in /opt/ml/model/checkpoint-11500/config.json"
1630411325887,"[INFO|modeling_utils.py:889] 2021-08-31 12:02:05,227 >> Model weights saved in /opt/ml/model/checkpoint-11500/pytorch_model.bin"
1630411325887,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:02:05,228 >> tokenizer config file saved in /opt/ml/model/checkpoint-11500/tokenizer_config.json"
1630411325887,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:02:05,229 >> Special tokens file saved in /opt/ml/model/checkpoint-11500/special_tokens_map.json"
1630411405909,"{‘loss’: 2.8542, ‘learning_rate’: 3.471298631812276e-05, ‘epoch’: 0.92}"
1630411405909,"[INFO|trainer.py:1885] 2021-08-31 12:03:25,866 >> Saving model checkpoint to /opt/ml/model/checkpoint-12000"
1630411405909,"[INFO|configuration_utils.py:351] 2021-08-31 12:03:25,867 >> Configuration saved in /opt/ml/model/checkpoint-12000/config.json"
1630411406909,"[INFO|modeling_utils.py:889] 2021-08-31 12:03:26,821 >> Model weights saved in /opt/ml/model/checkpoint-12000/pytorch_model.bin"
1630411406909,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:03:26,823 >> tokenizer config file saved in /opt/ml/model/checkpoint-12000/tokenizer_config.json"
1630411406909,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:03:26,823 >> Special tokens file saved in /opt/ml/model/checkpoint-12000/special_tokens_map.json"
1630411488932,"{‘loss’: 2.8424, ‘learning_rate’: 3.4076027414711205e-05, ‘epoch’: 0.96}"
1630411488932,"[INFO|trainer.py:1885] 2021-08-31 12:04:47,986 >> Saving model checkpoint to /opt/ml/model/checkpoint-12500"
1630411488932,"[INFO|configuration_utils.py:351] 2021-08-31 12:04:47,987 >> Configuration saved in /opt/ml/model/checkpoint-12500/config.json"
1630411488932,"[INFO|modeling_utils.py:889] 2021-08-31 12:04:48,930 >> Model weights saved in /opt/ml/model/checkpoint-12500/pytorch_model.bin"
1630411488932,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:04:48,931 >> tokenizer config file saved in /opt/ml/model/checkpoint-12500/tokenizer_config.json"
1630411488932,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:04:48,932 >> Special tokens file saved in /opt/ml/model/checkpoint-12500/special_tokens_map.json"
1630411571119,"{‘loss’: 2.8325, ‘learning_rate’: 3.343906851129966e-05, ‘epoch’: 0.99}"
1630411571119,"[INFO|trainer.py:1885] 2021-08-31 12:06:10,377 >> Saving model checkpoint to /opt/ml/model/checkpoint-13000"
1630411571119,"[INFO|configuration_utils.py:351] 2021-08-31 12:06:10,378 >> Configuration saved in /opt/ml/model/checkpoint-13000/config.json"
1630411572119,"[INFO|modeling_utils.py:889] 2021-08-31 12:06:11,310 >> Model weights saved in /opt/ml/model/checkpoint-13000/pytorch_model.bin"
1630411572119,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:06:11,311 >> tokenizer config file saved in /opt/ml/model/checkpoint-13000/tokenizer_config.json"
1630411572119,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:06:11,311 >> Special tokens file saved in /opt/ml/model/checkpoint-13000/special_tokens_map.json"
1630411653265,"{‘loss’: 2.7944, ‘learning_rate’: 3.2802109607888096e-05, ‘epoch’: 1.03}"
1630411653265,"[INFO|trainer.py:1885] 2021-08-31 12:07:32,619 >> Saving model checkpoint to /opt/ml/model/checkpoint-13500"
1630411653265,"[INFO|configuration_utils.py:351] 2021-08-31 12:07:32,620 >> Configuration saved in /opt/ml/model/checkpoint-13500/config.json"
1630411654265,"[INFO|modeling_utils.py:889] 2021-08-31 12:07:33,546 >> Model weights saved in /opt/ml/model/checkpoint-13500/pytorch_model.bin"
1630411654266,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:07:33,547 >> tokenizer config file saved in /opt/ml/model/checkpoint-13500/tokenizer_config.json"
1630411654266,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:07:33,547 >> Special tokens file saved in /opt/ml/model/checkpoint-13500/special_tokens_map.json"
1630411735423,"{‘loss’: 2.7854, ‘learning_rate’: 3.216515070447655e-05, ‘epoch’: 1.07}"
1630411735423,"[INFO|trainer.py:1885] 2021-08-31 12:08:54,630 >> Saving model checkpoint to /opt/ml/model/checkpoint-14000"
1630411735423,"[INFO|configuration_utils.py:351] 2021-08-31 12:08:54,631 >> Configuration saved in /opt/ml/model/checkpoint-14000/config.json"
1630411736424,"[INFO|modeling_utils.py:889] 2021-08-31 12:08:55,572 >> Model weights saved in /opt/ml/model/checkpoint-14000/pytorch_model.bin"
1630411736424,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:08:55,573 >> tokenizer config file saved in /opt/ml/model/checkpoint-14000/tokenizer_config.json"
1630411736424,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:08:55,573 >> Special tokens file saved in /opt/ml/model/checkpoint-14000/special_tokens_map.json"
1630411817445,"{‘loss’: 2.7737, ‘learning_rate’: 3.1528191801065e-05, ‘epoch’: 1.11}"
1630411817445,"[INFO|trainer.py:1885] 2021-08-31 12:10:16,697 >> Saving model checkpoint to /opt/ml/model/checkpoint-14500"
1630411817445,"[INFO|configuration_utils.py:351] 2021-08-31 12:10:16,699 >> Configuration saved in /opt/ml/model/checkpoint-14500/config.json"
1630411818445,"[INFO|modeling_utils.py:889] 2021-08-31 12:10:17,641 >> Model weights saved in /opt/ml/model/checkpoint-14500/pytorch_model.bin"
1630411818445,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:10:17,641 >> tokenizer config file saved in /opt/ml/model/checkpoint-14500/tokenizer_config.json"
1630411818445,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:10:17,642 >> Special tokens file saved in /opt/ml/model/checkpoint-14500/special_tokens_map.json"
1630411899466,"{‘loss’: 2.7652, ‘learning_rate’: 3.0891232897653446e-05, ‘epoch’: 1.15}"
1630411899467,"[INFO|trainer.py:1885] 2021-08-31 12:11:38,697 >> Saving model checkpoint to /opt/ml/model/checkpoint-15000"
1630411899467,"[INFO|configuration_utils.py:351] 2021-08-31 12:11:38,698 >> Configuration saved in /opt/ml/model/checkpoint-15000/config.json"
1630411900467,"[INFO|modeling_utils.py:889] 2021-08-31 12:11:39,673 >> Model weights saved in /opt/ml/model/checkpoint-15000/pytorch_model.bin"
1630411900467,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:11:39,674 >> tokenizer config file saved in /opt/ml/model/checkpoint-15000/tokenizer_config.json"
1630411900467,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:11:39,674 >> Special tokens file saved in /opt/ml/model/checkpoint-15000/special_tokens_map.json"
1630411981504,"{‘loss’: 2.7593, ‘learning_rate’: 3.0254273994241895e-05, ‘epoch’: 1.18}"
1630411981504,"[INFO|trainer.py:1885] 2021-08-31 12:13:00,859 >> Saving model checkpoint to /opt/ml/model/checkpoint-15500"
1630411981504,"[INFO|configuration_utils.py:351] 2021-08-31 12:13:00,860 >> Configuration saved in /opt/ml/model/checkpoint-15500/config.json"
1630411982505,"[INFO|modeling_utils.py:889] 2021-08-31 12:13:01,856 >> Model weights saved in /opt/ml/model/checkpoint-15500/pytorch_model.bin"
1630411982505,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:13:01,857 >> tokenizer config file saved in /opt/ml/model/checkpoint-15500/tokenizer_config.json"
1630411982505,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:13:01,858 >> Special tokens file saved in /opt/ml/model/checkpoint-15500/special_tokens_map.json"
1630412063526,"{‘loss’: 2.7469, ‘learning_rate’: 2.9617315090830338e-05, ‘epoch’: 1.22}"
1630412063526,"[INFO|trainer.py:1885] 2021-08-31 12:14:22,985 >> Saving model checkpoint to /opt/ml/model/checkpoint-16000"
1630412063526,"[INFO|configuration_utils.py:351] 2021-08-31 12:14:22,986 >> Configuration saved in /opt/ml/model/checkpoint-16000/config.json"
1630412064526,"[INFO|modeling_utils.py:889] 2021-08-31 12:14:23,943 >> Model weights saved in /opt/ml/model/checkpoint-16000/pytorch_model.bin"
1630412064526,"[INFO|tokenization_utils_base.py:1924] 2021-08-31 12:14:23,944 >> tokenizer config file saved in /opt/ml/model/checkpoint-16000/tokenizer_config.json"
1630412064526,"[INFO|tokenization_utils_base.py:1930] 2021-08-31 12:14:23,944 >> Special tokens file saved in /opt/ml/model/checkpoint-16000/special_tokens_map.json"
1630412066527,"#0150 tables [00:00, ? tables/s]#0155 tables [00:00, 47.79 tables/s]#01513 tables [00:00, 54.28 tables/s]#01521 tables [00:00, 60.07 tables/s]#01529 tables [00:00, 64.80 tables/s]#01537 tables [00:00, 68.13 tables/s]#01545 tables [00:00, 71.24 tables/s]#01553 tables [00:00, 73.64 tables/s]#015 #015[INFO|file_utils.py:1532] 2021-08-31 11:28:48,440 >> https://huggingface.co/distilgpt2/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpjs_xy61y"
1630412066527,"#015Downloading: 0%| | 0.00/762 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 762/762 [00:00<00:00, 997kB/s]"
1630412066527,"[INFO|file_utils.py:1536] 2021-08-31 11:28:48,460 >> storing https://huggingface.co/distilgpt2/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631"
1630412066527,"[INFO|file_utils.py:1544] 2021-08-31 11:28:48,460 >> creating metadata file for /root/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631"
1630412066527,"[INFO|configuration_utils.py:517] 2021-08-31 11:28:48,460 >> loading configuration file https://huggingface.co/distilgpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631"
1630412066527,"[INFO|configuration_utils.py:553] 2021-08-31 11:28:48,461 >> Model config GPT2Config {
“”_num_labels"": 1,
““activation_function””: ““gelu_new””,
““architectures””: [
““GPT2LMHeadModel””
],
““attn_pdrop””: 0.1,
““bos_token_id””: 50256,
““embd_pdrop””: 0.1,
““eos_token_id””: 50256,
““gradient_checkpointing””: false,
““id2label””: {
““0"”: ““LABEL_0"”
},
““initializer_range””: 0.02,
““label2id””: {
““LABEL_0"”: 0
},
““layer_norm_epsilon””: 1e-05,
““model_type””: ““gpt2"”,
““n_ctx””: 1024,
““n_embd””: 768,
““n_head””: 12,
““n_inner””: null,
““n_layer””: 6,
““n_positions””: 1024,
““resid_pdrop””: 0.1,
““scale_attn_weights””: true,
““summary_activation””: null,
““summary_first_dropout””: 0.1,
““summary_proj_to_labels””: true,
““summary_type””: ““cls_index””,
““summary_use_proj””: true,
““task_specific_params””: {
““text-generation””: {
““do_sample””: true,
““max_length””: 50
}
},
““transformers_version””: ““4.6.1"”,
““use_cache””: true,
““vocab_size””: 50257”
1630412066527,”}
"
1630412066527,”[INFO|configuration_utils.py:517] 2021-08-31 11:28:48,481 >> loading configuration file https://huggingface.co/distilgpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631”
1630412066527,”[INFO|configuration_utils.py:553] 2021-08-31 11:28:48,482 >> Model config GPT2Config {
“”_num_labels"": 1,
““activation_function””: ““gelu_new””,
““architectures””: [
““GPT2LMHeadModel””
],
““attn_pdrop””: 0.1,
““bos_token_id””: 50256,
““embd_pdrop””: 0.1,
““eos_token_id””: 50256,
““gradient_checkpointing””: false,
““id2label””: {
"“0"”: ““LABEL_0"””

Abel to train distill gpt with batch_size of 2
but at last getting this error

2021-08-31 15:12:16 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-4-c8ca6964838c> in <module>
     32 
     33 # starting the train job
---> 34 huggingface_estimator.fit({'training': 's3://domain-gen-data/domain-gen-training.csv'})

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1625         # If logs are requested, call logs_for_jobs.
   1626         if logs != "None":
-> 1627             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1628         else:
   1629             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3731 
   3732         if wait:
-> 3733             self._check_job_status(job_name, description, "TrainingJobStatus")
   3734             if dot:
   3735                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3291                 ),
   3292                 allowed_statuses=["Completed", "Stopped"],
-> 3293                 actual_status=status,
   3294             )
   3295 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-31-14-22-25-283: Failed. Reason: ClientError: Artifact upload failed:Insufficient disk space

it is referring to disk space of the notebook or s3?

Adding more disk space to your Training job should do the trick according to this stackoverflow thread

Volume_size increase helps a lot.
can you suggest any suitable instance and configurations for GPT-j for future reference.

Since GPT-J requires at least 24GB to load the model, you definitely need to use model-parallelism, Therefore I can suggest to you that you take a look at notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub

I am going to create an example after have the new DLCs out to show how someone can fine-tune GPT-J with model parallelism.

2 Likes

When is this coming @philschmid? Thank you.

Depending on how long the release of transformers 4.10 supported DLC takes. I that I can share something 2-3 weeks from now.

1 Like

What’s the rough timeline for DLC upgrades? As the original GPT-Neo implementation in transformers 4.5 had some major inefficiencies that make it tough to train/infer with, a version upgraded to the release that fixed that (I believe 4.11) would be very very useful.

Hey @charlesatftl,

We have already released DLCs for transformers 4.11 with PT1.9. Sadly, there is a bug in the python-sdk, so the image_uri cannot be generated automatically. Issue: aws/sagemaker-python-sdk#2700
Until this bug i solved you can use the image_uri directly.

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    #transformers_version='4.11',
    #pytorch_version='1.9',
    #py_version='py38',
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.9.0-transformers4.11.0-gpu-py38-cu111-ubuntu20.04",
    env=hub,
    role=role, 
)
1 Like