This is what I am receiving when using the Huggingface estimator:
2021-11-03 15:08:16 Starting - Starting the training job...
2021-11-03 15:08:42 Starting - Launching requested ML instancesProfilerReport-1635952096: InProgress
...
2021-11-03 15:09:13 Starting - Preparing the instances for training............
2021-11-03 15:11:02 Downloading - Downloading input data
2021-11-03 15:11:02 Training - Downloading the training image..................
2021-11-03 15:14:16 Uploading - Uploading generated training modelbash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-11-03 15:14:11,348 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2021-11-03 15:14:11,372 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2021-11-03 15:14:12,797 sagemaker_pytorch_container.training INFO Invoking user training script.
2021-11-03 15:14:13,102 sagemaker-training-toolkit ERROR Reporting training FAILURE
2021-11-03 15:14:13,102 sagemaker-training-toolkit ERROR framework error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
entrypoint()
File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
train(environment.Environment())
File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
runner_type=runner_type)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
files.download_and_extract(uri=uri, path=environment.code_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
s3_download(uri, dst)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
s3.Bucket(bucket).download_file(key, dst)
File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 172, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/opt/conda/lib/python3.6/site-packages/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 106, in result
return self._coordinator.result()
File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 265, in result
raise self._exception
File "/opt/conda/lib/python3.6/site-packages/s3transfer/tasks.py", line 255, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/s3transfer/download.py", line 343, in _submit
**transfer_future.meta.call_args.extra_args
File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
An error occurred (404) when calling the HeadObject operation: Not Found
2021-11-03 15:14:45 Failed - Training job failed
ProfilerReport-1635952096: Stopping
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-118-7a4d3426c635> in <module>
6
7 # starting the train job with our uploaded datasets as input
----> 8 huggingface_estimator.fit(training_data, wait=True)
/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):
/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1650 # If logs are requested, call logs_for_jobs.
1651 if logs != "None":
-> 1652 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1653 else:
1654 self.sagemaker_session.wait_for_job(self.job_name)
/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3776
3777 if wait:
-> 3778 self._check_job_status(job_name, description, "TrainingJobStatus")
3779 if dot:
3780 print()
/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3333 ),
3334 allowed_statuses=["Completed", "Stopped"],
-> 3335 actual_status=status,
3336 )
3337
UnexpectedStatusException: Error for Training job operations-email-classification-2021-11-2021-11-03-15-08-16-273: Failed. Reason: AlgorithmError: framework error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
entrypoint()
File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
train(environment.Environment())
File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
runner_type=runner_type)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
files.download_and_extract(uri=uri, path=environment.code_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
s3_download(uri, dst)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
s3.Bucket(bucket).download_file(key, dst)
File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
Extr