after 0.07 epochs the training job stops and gives the following error :-
[INFO|trainer.py:1885] 2021-11-19 18:43:48,502 >> Saving model checkpoint to /opt/ml/model/checkpoint-3500
[INFO|configuration_utils.py:351] 2021-11-19 18:43:48,503 >> Configuration saved in /opt/ml/model/checkpoint-3500/config.json
#015Downloading: 0%| | 0.00/7.78k [00:00<?, ?B/s]#015Downloading: 28.8kB [00:00, 16.0MB/s]
#015Downloading: 0%| | 0.00/4.47k [00:00<?, ?B/s]#015Downloading: 28.7kB [00:00, 17.6MB/s]
2021-11-19 18:44:10 Uploading - Uploading generated training model
2021-11-19 18:44:10 Failed - Training job failed
ProfilerReport-1637341687: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-13-6a9d8eb3a402> in <module>
33
34 # starting the train job
---> 35 huggingface_estimator.fit()
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1650 # If logs are requested, call logs_for_jobs.
1651 if logs != "None":
-> 1652 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1653 else:
1654 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3776
3777 if wait:
-> 3778 self._check_job_status(job_name, description, "TrainingJobStatus")
3779 if dot:
3780 print()
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3333 ),
3334 allowed_statuses=["Completed", "Stopped"],
-> 3335 actual_status=status,
3336 )
3337
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-11-19-17-08-07-355: Failed. Reason: ClientError: Artifact upload failed:Error 5: Received a failed archive status.
Thank you for your help.