ClientError: Artifact upload failed:Error 5

after 0.07 epochs the training job stops and gives the following error :-

[INFO|trainer.py:1885] 2021-11-19 18:43:48,502 >> Saving model checkpoint to /opt/ml/model/checkpoint-3500
[INFO|configuration_utils.py:351] 2021-11-19 18:43:48,503 >> Configuration saved in /opt/ml/model/checkpoint-3500/config.json
#015Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]#015Downloading: 28.8kB [00:00, 16.0MB/s]                   
#015Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]#015Downloading: 28.7kB [00:00, 17.6MB/s]                   

2021-11-19 18:44:10 Uploading - Uploading generated training model
2021-11-19 18:44:10 Failed - Training job failed
ProfilerReport-1637341687: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-13-6a9d8eb3a402> in <module>
     33 
     34 # starting the train job
---> 35 huggingface_estimator.fit()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    690         self.jobs.append(self.latest_training_job)
    691         if wait:
--> 692             self.latest_training_job.wait(logs=logs)
    693 
    694     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1650         # If logs are requested, call logs_for_jobs.
   1651         if logs != "None":
-> 1652             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1653         else:
   1654             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3776 
   3777         if wait:
-> 3778             self._check_job_status(job_name, description, "TrainingJobStatus")
   3779             if dot:
   3780                 print()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3333                 ),
   3334                 allowed_statuses=["Completed", "Stopped"],
-> 3335                 actual_status=status,
   3336             )
   3337 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-11-19-17-08-07-355: Failed. Reason: ClientError: Artifact upload failed:Error 5: Received a failed archive status.

Thank you for your help. :smile:

What did you save to opt/ml/model ?

the model with its checkpoints as folders

Do you know how big it was? This normally work.

Not really but can Easily go to 100gb plus.

You could use checkpointing with sagemaker, which automatically syncs the checkpoints to S3 when they are created and only save the model without checkpoints at the end to /opt/ml/model

1 Like

ok thanks.