ClientError: Artifact upload failed:Error 5

after 0.07 epochs the training job stops and gives the following error :-

[INFO|trainer.py:1885] 2021-11-19 18:43:48,502 >> Saving model checkpoint to /opt/ml/model/checkpoint-3500
[INFO|configuration_utils.py:351] 2021-11-19 18:43:48,503 >> Configuration saved in /opt/ml/model/checkpoint-3500/config.json
#015Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]#015Downloading: 28.8kB [00:00, 16.0MB/s]                   
#015Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]#015Downloading: 28.7kB [00:00, 17.6MB/s]                   

2021-11-19 18:44:10 Uploading - Uploading generated training model
2021-11-19 18:44:10 Failed - Training job failed
ProfilerReport-1637341687: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-13-6a9d8eb3a402> in <module>
     33 
     34 # starting the train job
---> 35 huggingface_estimator.fit()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    690         self.jobs.append(self.latest_training_job)
    691         if wait:
--> 692             self.latest_training_job.wait(logs=logs)
    693 
    694     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1650         # If logs are requested, call logs_for_jobs.
   1651         if logs != "None":
-> 1652             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1653         else:
   1654             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3776 
   3777         if wait:
-> 3778             self._check_job_status(job_name, description, "TrainingJobStatus")
   3779             if dot:
   3780                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3333                 ),
   3334                 allowed_statuses=["Completed", "Stopped"],
-> 3335                 actual_status=status,
   3336             )
   3337 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-11-19-17-08-07-355: Failed. Reason: ClientError: Artifact upload failed:Error 5: Received a failed archive status.

Thank you for your help. :smile:

What did you save to opt/ml/model ?

the model with its checkpoints as folders

Do you know how big it was? This normally work.

Not really but can Easily go to 100gb plus.